[ 
https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898647#action_12898647
 ] 

Andrzej Bialecki  commented on NUTCH-887:
-----------------------------------------

bq. If there's something missing that Nutch needs, we'll add it to Tika and 
roll it into 0.8.

There is something missing in Tika, and it's the support for compound 
documents, but it's not likely to be added in 0.8... not that we have such 
support in Nutch at the moment - it fell victim to the trunk/nutchbase switch, 
but it should be added back soon. I'd keep the "feed" plugin around for a while 
still, as an interim solution until Tika supports compound documents. +1 to 
getting rid of parse-rss.

> Delegate parsing of feeds to Tika
> ---------------------------------
>
>                 Key: NUTCH-887
>                 URL: https://issues.apache.org/jira/browse/NUTCH-887
>             Project: Nutch
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>             Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could 
> rely on the one we recently added to Tika, knowing that there is a 
> substantial difference in the sense that the Tika feed parser generates a 
> simple XHTML representation of the document where the feeds are simply 
> represented as anchors whereas the Nutch version created new documents for 
> each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's 
> the difference with the feed one again? Since the Tika parser would handle 
> all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to