HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
I am in need of feedback/ideas. ;)

What would be the cleanest way to not return only one ParseData (or
Parse) object from a getParse but return several and still use the rest
of the framework? Anybody did this?
I look at the different classes and where it could be done but I always
find me breaking the whole process and having to change the code in a
lot of places.

Well, the problem with this is that current Nutch architecture follows several assumptions that make this difficult:

1) it enforces a strong split between protocol and parse layers, once the resource content leaves the protocol layer there is no way back to fetch additional resources (but see below),

2) it assumes that one input URL results in a single resource

3) it assumes that URLs identify independent resources (there is no composition or aggregation of resources).

4) fetching is performed breadth-first, in random order.

Of course, that's a bunch of idealistic assumptions ... ;) In reality, Nutch compromises some of them:

ad 1) some of the parse-level data gets pushed down to the protocol layer if needed, namely redirects and robot exclusions metadata from HTML meta tags (the same should be done for set-cookie, but this is not handled yet). This is further complicated by the fact that fetching and parsing don't have to be tightly coupled in a single process, they may be executed as separate batch jobs - so there are private mini-protocols between these layers to facilitate passing this info across batch runs.

ad 2) only redirects are handled now, in the sense that all data (both the response before redirect and after redirect) are stored. There is no support for returning multiple responses from a single request. RSS is a good example of why we would need to extend the API to provide this support. Exhaustive fetching scenarios (e.g. collect all URLs below that URL path) would be another case. Crawling a DB (select * from $TABLE) would be yet another case where this support would make sense.

ad 3) Nutch doesn't handle this at all now. This is sometimes frustrating, because if you get one part of a page (e.g. the top frame), you can't be sure that you got all subcomponents (images, nested frames, scripts) that match this particular version of the container-type resource. This may affect the subsequent analysis of the page, and eventually it will affect the "cached view". Support for this functionality would be a welcome addition. I intended to pursue this subject when I added ParseStatus.FAILED_MISSING_PARTS - please see the javadoc there - however, no code at the moment makes use of this.

ad 4) fetch jobs are organized along randomized fetchlists, and not high-level instructions like "fetch depth-first starting from this url, n-levels deep, at most M pages". This could be fixed by changing the Generator and Fetcher (or rather implementing alternative versions of each).

The use case is like the following:
An RSS document has items, the goal is to index the Items and not the
channel like parse-rss does.
So the steps would be
-extract outlinks, keywords ...  for one item
And do it for all the items in the Content.
I think it would then require different ParseImpl, ParseSegment,
Indexer, signature, DeleteDuplicate etc...

I don't think all of them would have to be modified - so long as you don't change the segment format most tools should work properly. A lot of meta-information (like aggregation relationships) can be carried across in CrawlDatum.metaData or ParseData.metadata.

Am I completely wrong?
I am trying to use as much Nutch stuff as possible as I use it for HTML
stuff also. Otherwise, I'll go for mostly hadoop and some sort of
light-nutch with a homemade scheduler/adaptive fetch/crawldb/parser.

Your thoughts are much appreciated to help my brain on a Friday end of
afternoon... ;)

Well, hard to say if it's better to work-around / change the Fetcher and associated tools, or just pick some Nutch parts (crawldb, segments, parsers, protocol handlers) and write your own fetcher/generator, using hadoop as the overall framework.

Unfortunately, some seemingly simple changes (like e.g. extending Protocol interface to return Iterator<ProtocolOutput>, and Parser to return Iterator<Parse>) have far reaching consequences across many parts of Nutch, not only from purely mechanical view of API compatibility, but from the semantic POV (discovering new resources, updating old ones, managing part-whole relationships, etc).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to