[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

Andrzej Bialecki (JIRA) Sun, 15 Aug 2010 00:02:50 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898667#action_12898667
 ]


Andrzej Bialecki  commented on NUTCH-887:
-----------------------------------------

bq. Huh, what do you mean? Nick just added a bunch of code to handle Compound 
document detection, and parsing

Ah, good - I missed that, I need to take a closer look at this...

bq. I'm starting to feel the creep of parsing plugins make their way back into 
Nutch instead of just jumping over into Tika

The "creep" so far is just parse-html, which we were forced to add back because 
Tika HTML parsing was totally inadequate to our needs. I know there have been 
some progress on this front, but I suspect it's still not sufficient. The 
ultimate goal is still to use Tika for all formats that it can handle, 
preferrably "all formats" without further qualifiers ;)

> Delegate parsing of feeds to Tika
> ---------------------------------
>
>                 Key: NUTCH-887
>                 URL: https://issues.apache.org/jira/browse/NUTCH-887
>             Project: Nutch
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>             Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could 
> rely on the one we recently added to Tika, knowing that there is a 
> substantial difference in the sense that the Tika feed parser generates a 
> simple XHTML representation of the document where the feeds are simply 
> represented as anchors whereas the Nutch version created new documents for 
> each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's 
> the difference with the feed one again? Since the Tika parser would handle 
> all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

Reply via email to