Chirag Chaman wrote:
Andrzej:

On the same note, let me list examples of certain analysis that should be
helpful and I'd appreciate it if you can point where is an appropriate place
to add the code. Right now these sit external for us, but it would be nice
to integrate them to Nutch.

A general note: my point of view on these issues is that one should implement the control points as soon as it is possible in the processing chain, i.e. when sufficient information is available to make an informed decision. This is to limit the amount of data to be processed, which could make a huge difference in terms of storage/cpu/bandwidth.


However, one may want to postpone decisions to later stages if some other processing (like e.g. language detection) is expensive and is run anyway in one of the later stages.


1. Content - total size < X bytes - discard and mark.

The content size is available only at the fetch stage. However...

I'm working slowly on moving the interactions between fetcher and protocol plugins to use FetchListEntry data instead of just URL (this is needed to implement dynamic re-fetch interval). In other words a Protocol would use:

        Content getContent(FetchListEntry fle)

instead of the current:

        Content getContent(String url)

because the protocol plugins will need to make protocol-dependent decisions whether to fetch the content based on metadata available during fetching (like Last-Modified or If-Modified-Since).

If/when I complete this change, then it will be easier to put all protocol-dependent decisions into protocol plugins (IMO a new factory, e.g. ProtocolFilterFactory should be used for that), and content-dependent decisions using ContentFilter into FetcherThread (Fetcher.java:108).

Some of these decisions could be delayed to the stage of updating the database or building indexes, but then you would have to either re-filter all segment data (trivial, but time/space consuming task) to delete unwanted content, or use some logic to "hide" it from the WebDB update stage and from the segment indexing stage. So it seems the Fetcher is still the best place to do it...


2. Content - HTML tag to content ratio < threshold -- discard and mark

Well, this is format-specific, so it could be put into the parse plugin specific for this format. But perhaps it would be simpler to centralize these kind of decisions in Fetcher, so it could be implemented as a ContentFilter in Fetcher. But this adds a new requirement to a ContentFilter interface that it should also consider Parse results. Or we could provide a separate hook to call some other type of filter, let's say ExtendedContentFilter, after the Content has been parsed:


        Content filter(Content content, Parse parse);

This approach has also the benefit that you could replace the original content with something more suitable for web interface preview (e.g. replace PDF with HTML - currently Nutch doesn't allow you out-of-the-box to view cached copies of non-html formats).

I'm not sure which way is better...

3. Link analysis - incoming to outgoing link ratio is too low

You only know the outgoing links after you have parsed the Content. So it's the same situation as with the case above.



4. File Size - the max file size to fetch based on type. Example, a file of 64k for HTML maybe fine, but not for a PDF -- this currently in Nutch will cause a "Fetched but cannot parse error". Thus it would be nice to have a property in the plugin xml file that specifies the max fetch bytes, and the action if this is hit (parse or discard)

I agree. Currently there is only a single value for all types of plugins, which as you say is often inappropriate.


The PluginManifestParser supports the use of arbitrary attributes in definitions of <implementation> elements - these values are then passed to the plugin implementation (see for example the plugin in language-identifier/plugin.xml).

So, it's possible even now to modify individual plugins and set their limits separately from plugin.xml files. However, I'm a bit afraid of the configuration hassle this could bring - instead of one central config file (nutch-site.xml), which defines your runtime parameters, now you need to check many files... Perhaps a better way would be to put these limits in the nutch-default.xml config file?

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to