Matt Kangas wrote:
On Mon, 17 Jan 2005 16:17:46 +0100, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

.... Or we could provide a separate hook to call some other type of filter,
let's say ExtendedContentFilter, after the Content has been parsed:

       Content filter(Content content, Parse parse);

This approach has also the benefit that you could replace the original
content with something more suitable for web interface preview (e.g.
replace PDF with HTML - currently Nutch doesn't allow you out-of-the-box
to view cached copies of non-html formats).


Andrej, I think this is a great idea. The ContentFilter interface
would be much more useful if the parsed data was available for
analysis too. I'd suggest keeping the interface very simple -- perhaps
the above signature is all that's needed. If a given filter doesn't
care about Parse data, it can ignore it.

However, I'm not sure about content-transforming filters. Wouldn't you
want to get both Content and Parse back from filter() if this was the
goal?

In general case - I don't know yet... ;-) Both arguments are passed by reference, so if you replace them both with other instances then you are in trouble. You could add a simple data holding class with these two fields for using this in filter()..


But in the case of non-html formats like PDF and Word I imagine that normally you would want to keep text and metadata from the parsing of the original format, optionally adding some metadata to Parse instance (but without replacing it with a new instance...). In other words, I think that only the Content instance would be completely replaced (hence it needs to be returned), and the Parse instance would be only slightly modified.

I now had a look at available methods in Content and Parse/ParseData - Content is basically read-only, and Parse.text as well, also ParseData allows you to change only the metadata part... Hmmm. Not much can be changed here by the filter. We could add setters to these classes, though.


-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to