I am in need of feedback/ideas. ;) What would be the cleanest way to not return only one ParseData (or Parse) object from a getParse but return several and still use the rest of the framework? Anybody did this? I look at the different classes and where it could be done but I always find me breaking the whole process and having to change the code in a lot of places.
The use case is like the following: An RSS document has items, the goal is to index the Items and not the channel like parse-rss does. So the steps would be -extract outlinks, keywords ... for one item And do it for all the items in the Content. I think it would then require different ParseImpl, ParseSegment, Indexer, signature, DeleteDuplicate etc... Am I completely wrong? I am trying to use as much Nutch stuff as possible as I use it for HTML stuff also. Otherwise, I'll go for mostly hadoop and some sort of light-nutch with a homemade scheduler/adaptive fetch/crawldb/parser. Your thoughts are much appreciated to help my brain on a Friday end of afternoon... ;) Thanks! Jeremy.
