Brian Whitman wrote:
How about an outlink filter that works during parse? In
ParseOutputFormat,
it will take the parse text, parse data (etc.) of the source page and
the destination url then will either return "filter this outlink" or
"let it through".
Write an HtmlParseFilter that sets an attribute in the ParseData
MetaData based on whether the page contains what you are looking for.
Then write another MR job that runs after the crawl/index cycle. This
job would need to update the CrawlDatum MetaData based on your
priority calculation (inlinks and contains text, etc.). Then hack the
Generator class around line 160 to change the sort value that it is
using based on the CrawlDatum MetaData. I would make using this new
sort value an option that you can turn on and off by using different
configuration values.
Hi Doğacan, Dennis:
Thanks for the ideas. I spent some time mentally planning out how to
implement both of these ideas by looking at the source. I'm still newish
to Nutch so excuse my naiveté.
Do either of these approaches let me get at the analyzed/indexed
contents of the page text so that I can perform Lucene queries for
filtering? What I could tell of the HtmlParseFilter or Parse in general
is that it gets me at the parse tree, which i could do regexp queries on
-- but I'd rather it be all in Lucene and be influenced by the relative
ranking of terms amongst all documents. I am envisioning machine
generated queries from our classifiers that might be hundreds of tokens
long with boost values per term, and a score threshold. So I'd need to
act on the documents post-index. Unless I'm reading your suggestions
incorrectly, neither of them let me at that?
You could drop the HtmlParseFilter part and simply write the post
crawl/index MR job after to update the CrawlDatum based on your lucene
queries. You would still need to write the second part that does the
generation based on a different sort value.
I am currently looking at PruneIndexTool -- could a modification of this
work? I could run it after a crawl/index cycle but before invertlinks
and the next generate. The one issue I see is that PruneIndexTool claims
not to affect the WebDB. Does this mean that even though the lucene doc
will be gone, the link and outlinks will remain in the WebDB and will be
fetched anyway?
That is correct. You will need to alter the CrawlDb to affect what is
generated and hence fetched.
If I should instead be looking harder at your recommended
HtmlParseFilter or ParseOutputFormat, please correct me.
No if you are doing complex queries instead of something like "if this
page contains words x, y, and z" then I wouldn't do it through
HtmlParseFilter I would probably go with the lucene after index approach.
Dennis Kubes
-Brian