Nicola Tonellotto created NUTCH-2172:
----------------------------------------
Summary: Parsing whitespace not just tabs in
contenttype-mapping.txt
Key: NUTCH-2172
URL: https://issues.apache.org/jira/browse/NUTCH-2172
Project: Nutch
Issue Type: Bug
Components: metadata
Affects Versions: 1.10
Environment: Macosx, Java 8
Reporter: Nicola Tonellotto
Priority: Minor
The index-more plugin uses the conf/contenttype-mapping.txt file to build up
the mimeMap hash table (in the readConfiguration() method).
The line splitting is performed around "\t", so it silently skip lines
separated by simple spaces or more than one tab (see line 325).
Changing the single-char string "\t" with the regex "\\s+" should do the magic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)