If I understand what you are trying to do then here is how I would
approach it.
Write an HtmlParseFilter that sets an attribute in the ParseData
MetaData based on whether the page contains what you are looking for.
Then write another MR job that runs after the crawl/index cycle. This
job would need to update the CrawlDatum MetaData based on your priority
calculation (inlinks and contains text, etc.). Then hack the Generator
class around line 160 to change the sort value that it is using based on
the CrawlDatum MetaData. I would make using this new sort value an
option that you can turn on and off by using different configuration values.
Hope this helps.
Dennis Kubes
Brian Whitman wrote:
On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote:
You can use an HtmlParseFilter and then set a metadata attribute as to
whether or not it contains the phrase. Problem with this is that all
of the content is still stored. You could also change the
ParseOutputFormat to only write out if the word is contained although
that is a bit of a hack.
I'm not worried about a hack, our whole set up is very "der lauf der
dinge" and one more plank won't matter much :) But after sending my
question out, I realized that I would need to index the document anyway
before being able to lucene query it for topicality. I don't mind having
pages stored that don't match my query, but I really would rather the
generator not get more outlinks from those pages.
So a simple fix would be something I can write or run after a
crawl/index cycle that can mark certain pages to not emit more URIs in
the generator. It would query each page in an index and update some
flag. But what is that flag and how can I get at it?
And more advanced and later on -- the generator has smarts to prioritize
fetching by inlink counts-- is there something I can hack to "boost"
outlink fetches based on the source page's content? for example - I
find a page that scores high on my lucene query after crawl/index gets
done. I would want the generator to put all of its outlinks up top, even
if there's not many inlinks to that page... would this be a "generator
plugin?"
-Brian