Re: Parser returning several ParseData?

Andrzej Bialecki Sat, 15 Jul 2006 11:29:24 -0700

HUYLEBROECK Jeremy RD-ILAB-SSF wrote:

I am in need of feedback/ideas. ;)


What would be the cleanest way to not return only one ParseData (or
Parse) object from a getParse but return several and still use the rest
of the framework? Anybody did this?
I look at the different classes and where it could be done but I always
find me breaking the whole process and having to change the code in a
lot of places.

Well, the problem with this is that current Nutch architecture followsseveral assumptions that make this difficult:

1) it enforces a strong split between protocol and parse layers, oncethe resource content leaves the protocol layer there is no way back tofetch additional resources (but see below),


2) it assumes that one input URL results in a single resource

3) it assumes that URLs identify independent resources (there is nocomposition or aggregation of resources).


4) fetching is performed breadth-first, in random order.

Of course, that's a bunch of idealistic assumptions ... ;) In reality,Nutch compromises some of them:

ad 1) some of the parse-level data gets pushed down to the protocollayer if needed, namely redirects and robot exclusions metadata fromHTML meta tags (the same should be done for set-cookie, but this is nothandled yet). This is further complicated by the fact that fetching andparsing don't have to be tightly coupled in a single process, they maybe executed as separate batch jobs - so there are private mini-protocolsbetween these layers to facilitate passing this info across batch runs.

ad 2) only redirects are handled now, in the sense that all data (boththe response before redirect and after redirect) are stored. There is nosupport for returning multiple responses from a single request. RSS is agood example of why we would need to extend the API to provide thissupport. Exhaustive fetching scenarios (e.g. collect all URLs below thatURL path) would be another case. Crawling a DB (select * from $TABLE)would be yet another case where this support would make sense.

ad 3) Nutch doesn't handle this at all now. This is sometimesfrustrating, because if you get one part of a page (e.g. the top frame),you can't be sure that you got all subcomponents (images, nested frames,scripts) that match this particular version of the container-typeresource. This may affect the subsequent analysis of the page, andeventually it will affect the "cached view". Support for thisfunctionality would be a welcome addition. I intended to pursue thissubject when I added ParseStatus.FAILED_MISSING_PARTS - please see thejavadoc there - however, no code at the moment makes use of this.

ad 4) fetch jobs are organized along randomized fetchlists, and nothigh-level instructions like "fetch depth-first starting from this url,n-levels deep, at most M pages". This could be fixed by changing theGenerator and Fetcher (or rather implementing alternative versions ofeach).

The use case is like the following:
An RSS document has items, the goal is to index the Items and not the
channel like parse-rss does.
So the steps would be
-extract outlinks, keywords ...  for one item
And do it for all the items in the Content.
I think it would then require different ParseImpl, ParseSegment,
Indexer, signature, DeleteDuplicate etc...

I don't think all of them would have to be modified - so long as youdon't change the segment format most tools should work properly. A lotof meta-information (like aggregation relationships) can be carriedacross in CrawlDatum.metaData or ParseData.metadata.

Am I completely wrong?
I am trying to use as much Nutch stuff as possible as I use it for HTML
stuff also. Otherwise, I'll go for mostly hadoop and some sort of
light-nutch with a homemade scheduler/adaptive fetch/crawldb/parser.

Your thoughts are much appreciated to help my brain on a Friday end of
afternoon... ;)

Well, hard to say if it's better to work-around / change the Fetcher andassociated tools, or just pick some Nutch parts (crawldb, segments,parsers, protocol handlers) and write your own fetcher/generator, usinghadoop as the overall framework.

Unfortunately, some seemingly simple changes (like e.g. extendingProtocol interface to return Iterator<ProtocolOutput>, and Parser toreturn Iterator<Parse>) have far reaching consequences across many partsof Nutch, not only from purely mechanical view of API compatibility, butfrom the semantic POV (discovering new resources, updating old ones,managing part-whole relationships, etc).


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Parser returning several ParseData?

Reply via email to