Andrzej Bialecki wrote:
You can download the patch from here:
http://www.getopt.org/nutch/20050507.patch
I have not yet had a chance to try this. Following are some quick comments from reading the patch. Overall I think this is great stuff.
1. Why does an HTMLMetaTags need to be passed to Parser.parse()? This seems to cross an abstraction boundary, since the Parser interface is meant to be format and protocol independent. Is it not possible to store this meta info in the getParseData().getMetadata()?
That is an old patch - please take a look at the latest patch in JIRA, this patch doesn't violate the layering. The HTMLMetaTags is only passed as an argument to HtmlParseFilters.filter(), which one could argue is already format-specific...
Re: putting the meta tags into ParseData.metadata... this could save on re-parsing later on if some other component needs to reuse this information, but I'm not sure how to do this - I guess I would have to invent new pseudo-headers, like "X-ParseStatus-noIndex: true", etc...
2. I still have some concern about the transient nature of ParseStatus. It would be inexpensive to add it to ParseData, no? What if, e.g., the db update tool needed an aspect of the parse status?
Yes, I had some doubts about that, too... In the end I didn't decide to add it (yet), but I made it implement the Writable, so this can be a simple change.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
