Luis Lopez created NUTCH-2033:
---------------------------------
Summary: parse-tika skips valid documents.
Key: NUTCH-2033
URL: https://issues.apache.org/jira/browse/NUTCH-2033
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.10
Reporter: Luis Lopez
Fix For: 1.11
If we run:
bin/nutch parsechecker -dumpText
http://ngdc.noaa.gov/geoportal/openSearchDescription
we’ll get:
Status: failed(2,0): Can't retrieve Tika parser for mime-type
application/opensearchdescription+xml
the same occurs for:
bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
Both perfectly valid documents if they were returned as "application/xml" and
"text/plain" respectively.
This happens because parse-tika uses the mime type to retrieve a suitable
parser, some composite mime types are not included in this list even though
they are perfectly valid and parsable documents. This not taking into account
that servers often return incorrect mime types for the documents requested.
We created a helper class as a workaround for this issue. The class uses regex
expressions to define synonyms. In the first case any mime type that matches
"application/(.*)\+xml" will be replaced by "application/xml". This way
parse-tika will parse the document just fine.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)