[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204556#comment-13204556
 ] 

Markus Jelsma commented on NUTCH-1259:
--------------------------------------

Hi,

Consider the following URL that produces bad output. This URL is not the only 
producing bad output. We've seen countless of examples that produce funky 
values in both content meta and parse meta, or no value at all.

http://kam.mff.cuni.cz/conferences/GraDR/

The current Nutch trunk shows us the following meta data for this URL obtained 
via parsechecker with only parse-tika enabled:

{code}
Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 
14:37:47 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip 
Content-Location=index.html.bak Content-Type=application/x-trash 
Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) 
mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 
OpenSSL/0.9.8g 
Parse Metadata: Content-Encoding=ISO-8859-1
{code}

It's an application/x-trash according to content meta and no data is available 
in parse meta. But, it's just an ordinary HTML page. This cannot be true, from 
an index point of view we will never know that this is an HTML page. With this 
patch enabled we will get the following output:

{code}
Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 
14:40:15 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip 
Content-Location=index.html.bak Content-Type=application/x-trash 
Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) 
mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 
OpenSSL/0.9.8g 
Parse Metadata: Content-Encoding=ISO-8859-1 Content-Type=text/html
{code}

For us, this solves all problems as we now only rely on Tika's MIME-detector 
and store it in parse meta. The value of content meta cannot be trusted. It's 
the same as with languages, when we do not use Tika to detect the language we 
get all sorts of crap.

Since the upgrade to Tika 1.0 and with NUTCH-1230 we obtain the detected 
MIME-type but it's not added to the parse meta. Now it is.

Do you have another suggestion? 
                
> TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1259
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1259
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1259-1.5-1.patch
>
>
> The MIME-type detected by Tika's Detect() API is never added to a Parse's 
> ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
> up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to