[ 
https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Lopez updated NUTCH-2033:
------------------------------
    Description: 
If we run:
```
bin/nutch parsechecker -dumpText 
http://ngdc.noaa.gov/geoportal/openSearchDescription```

we’ll get:

Status: failed(2,0): Can't retrieve Tika parser for mime-type 
application/opensearchdescription+xml

the same occurs  for:
{code}
bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
{code}

Both perfectly valid documents if they were returned as "application/xml" and 
"text/plain" respectively. 

This happens because parse-tika uses the mime type to retrieve a suitable 
parser, some composite mime types are not included in this list even though 
they are perfectly valid and parsable documents. This not taking into account 
that servers often return incorrect mime types for the documents requested.

We created a helper class as a workaround for this issue. The class uses regex 
expressions to define synonyms. In the first case any mime type that matches 
"application/(.*)\+xml" will be replaced by "application/xml". This way 
parse-tika will parse the document just fine.



  was:
If we run:
bin/nutch parsechecker -dumpText 
http://ngdc.noaa.gov/geoportal/openSearchDescription

we’ll get:

Status: failed(2,0): Can't retrieve Tika parser for mime-type 
application/opensearchdescription+xml

the same occurs  for:
bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json

Both perfectly valid documents if they were returned as "application/xml" and 
"text/plain" respectively. 

This happens because parse-tika uses the mime type to retrieve a suitable 
parser, some composite mime types are not included in this list even though 
they are perfectly valid and parsable documents. This not taking into account 
that servers often return incorrect mime types for the documents requested.

We created a helper class as a workaround for this issue. The class uses regex 
expressions to define synonyms. In the first case any mime type that matches 
"application/(.*)\+xml" will be replaced by "application/xml". This way 
parse-tika will parse the document just fine.




> parse-tika skips valid documents.
> ---------------------------------
>
>                 Key: NUTCH-2033
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2033
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>              Labels: mime-type, parse-tika, parser, tika
>             Fix For: 1.11
>
>
> If we run:
> ```
> bin/nutch parsechecker -dumpText 
> http://ngdc.noaa.gov/geoportal/openSearchDescription```
> we’ll get:
> Status: failed(2,0): Can't retrieve Tika parser for mime-type 
> application/opensearchdescription+xml
> the same occurs  for:
> {code}
> bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
> {code}
> Both perfectly valid documents if they were returned as "application/xml" and 
> "text/plain" respectively. 
> This happens because parse-tika uses the mime type to retrieve a suitable 
> parser, some composite mime types are not included in this list even though 
> they are perfectly valid and parsable documents. This not taking into account 
> that servers often return incorrect mime types for the documents requested.
> We created a helper class as a workaround for this issue. The class uses 
> regex expressions to define synonyms. In the first case any mime type that 
> matches "application/(.*)\+xml" will be replaced by "application/xml". This 
> way parse-tika will parse the document just fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to