I am in need of feedback/ideas. ;)

What would be the cleanest way to not return only one ParseData (or
Parse) object from a getParse but return several and still use the rest
of the framework? Anybody did this?
I look at the different classes and where it could be done but I always
find me breaking the whole process and having to change the code in a
lot of places.

The use case is like the following:
An RSS document has items, the goal is to index the Items and not the
channel like parse-rss does.
So the steps would be
-extract outlinks, keywords ...  for one item
And do it for all the items in the Content.
I think it would then require different ParseImpl, ParseSegment,
Indexer, signature, DeleteDuplicate etc...

Am I completely wrong?
I am trying to use as much Nutch stuff as possible as I use it for HTML
stuff also. Otherwise, I'll go for mostly hadoop and some sort of
light-nutch with a homemade scheduler/adaptive fetch/crawldb/parser.

Your thoughts are much appreciated to help my brain on a Friday end of
afternoon... ;)
Thanks!

Jeremy.

Reply via email to