[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

Julien Nioche (Commented) (JIRA) Thu, 09 Feb 2012 08:20:28 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204619#comment-13204619
 ]


Julien Nioche commented on NUTCH-1259:
--------------------------------------

Thanks for the example. Here is a summary of what is happening.
The correct Mime-type guessed by Tika is stored in the Content object. This is 
what is then used during the parsing step to determine which implementation of 
the parser should be used. This value is what you can see displayed by the 
parser checker e.g.

{noformat} 
fetching: http://kam.mff.cuni.cz/conferences/GraDR/
parsing: http://kam.mff.cuni.cz/conferences/GraDR/
contentType: text/html
signature: 575aecee981b1aa03a145e3dc5b4de72
{noformat}

This is different from the value displayed in the content metadata which 
corresponds to what is returned in the protocol headers. It is also different 
from the value found in parse metadata which what can be found in the content. 
Note that there is no guarantee that these two values can be found.

Now the problem with [https://issues.apache.org/jira/browse/NUTCH-1258] is that 
while the ParserFilters have access to the Content object, this is not the case 
of the IndexingFilters. One option would be to have a bespoke Parser 
implementation to store a custom metadata to store the CT in the Content object 
(i.e. the one Tika guessed) then use that in the indexing filter. That's 
unnecessarily messy.

I think a cleaner approach would be to store the guessed content-type in the 
crawldatum metadata. This way we :
* can access it from the indexing filters (the parsing filter would still get 
it from Content if necessary)
* do not override the value stored in parse metadata
* can access it regardless of whether a document has been parsed or not
* have a mechanism which is independent from the actual parser used (html / 
tika / other)
* have the possibility of taking a different decision as to which value should 
be used (guessed vs protocol vs content)
* keep a trace of why such or such parser was used on a given document

This would be done in the output method of the class Fetcher.

What do you think?

 


                
> TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1259
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1259
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1259-1.5-1.patch
>
>
> The MIME-type detected by Tika's Detect() API is never added to a Parse's 
> ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
> up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

Reply via email to