[ 
https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2033:
-----------------------------------
    Fix Version/s:     (was: 1.16)
                   1.17

> parse-tika skips valid documents.
> ---------------------------------
>
>                 Key: NUTCH-2033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2033
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>            Assignee: Lewis John McGibbney
>            Priority: Major
>              Labels: mime-type, parse-tika, parser, tika
>             Fix For: 1.17
>
>
> If we run:
> {code}
> bin/nutch parsechecker -dumpText 
> http://ngdc.noaa.gov/geoportal/openSearchDescription
> {code}
> we’ll get:
> {code}
> Status: failed(2,0): Can't retrieve Tika parser for mime-type 
> application/opensearchdescription+xml
> {code}
> the same occurs  for:
> {code}
> bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
> {code}
> Both perfectly valid documents if they were returned as "application/xml" and 
> "text/plain" respectively. 
> This happens because parse-tika uses the mime type to retrieve a suitable 
> parser, some composite mime types are not included in this list even though 
> they are perfectly valid and parsable documents. This not taking into account 
> that servers often return incorrect mime types for the documents requested.
> We created a helper class as a workaround for this issue. The class uses 
> regex expressions to define synonyms. In the first case any mime type that 
> matches "application/(.*)\+xml" will be replaced by "application/xml". This 
> way parse-tika will parse the document just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to