Nicola Tonellotto created NUTCH-2172:
----------------------------------------

             Summary: Parsing whitespace not just tabs in 
contenttype-mapping.txt
                 Key: NUTCH-2172
                 URL: https://issues.apache.org/jira/browse/NUTCH-2172
             Project: Nutch
          Issue Type: Bug
          Components: metadata
    Affects Versions: 1.10
         Environment: Macosx, Java 8
            Reporter: Nicola Tonellotto
            Priority: Minor


The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
the mimeMap hash table (in the readConfiguration() method).
The line splitting is performed around "\t", so it silently skip lines 
separated by simple spaces or more than one tab (see line 325).
Changing the single-char string "\t" with the regex "\\s+" should do the magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to