[ 
https://issues.apache.org/jira/browse/NUTCH-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591045#comment-13591045
 ] 

Sebastian Nagel commented on NUTCH-1537:
----------------------------------------

Removing stuff could be done in a few ways:
# let 
[o.a.nutch.metadata.MetaData|http://nutch.apache.org/apidocs-1.6/org/apache/nutch/metadata/Metadata.html]
 implement all interfaces in 
[o.a.tika.metadata|http://tika.apache.org/1.3/api/org/apache/tika/metadata/package-summary.html]:
 there are many because Tika is about providing metadata. But Nutch is mostly 
used to fill an index with content and a few meta fields (the most useful for 
the user). So, do we really need all those predefined meta fields? If some 
users want it, this still can be done in a plugin.
# {{nutch.metadata.MetaData extends tika.MetaData implements 
nutch.metadata.Nutch}} : that would mean to replace the Nutch implementation of 
the MetaData class by that of Tika. MetaData is frequently used in Nutch simply 
as a key-multiple-value store. A dependency on Tika may cause troubles if Tika 
decides to change this class.
# keep nutch.metadata.MetaData and the classes holding the string constants 
related to crawling (metadata.Nutch and HttpHeaders). References from plugins 
(eg, feed or creativecommons) can be removed if these refer directly to 
tika-core (little drawback: each of these plugins will then contain 
tika-core.jar).

These possibilities are not mutually exclusive, and surely there are even more. 
I would vote to keep the metadata package as legacy code but try to make it 
smaller and more crawler-specific by removing the most obvious shared classes 
(@[~lewismc]: the amount of duplicated code is striking).
                
> Legacy metadata package needs to take advantage of Apache Tika metadata 
> package more.
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1537
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1537
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.6, 2.1
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.7, 2.2
>
>
> In Nutch, classes from the metadata package are being used in quite a number 
> of places. It is not currently being used to reflect the work going on in 
> Apache Tika and we need to better leverage the vocabularies available to us 
> from the dependency on Apache Tika.
> The introduction of TikaCoreProperties in Tika 1.2 is not currently leveraged 
> in Nutch. This is just one example of an improved way for us to add metadata 
> to Nutch documents.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to