Eric Osgood wrote:
Andrzej,
How would I check for a flag during fetch?
You would check for a flag during generation - please check
ScoringFilter.generatorSortValue(), that's where you can check for a
flag and set the sort value to Float.MIN_VALUE - this way the link will
never be selected for fetching.
And you would put the flag in CrawlDatum metadata when ParseOutputFormat
calls ScoringFilter.distributeScoreToOutlinks().
Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but
still needing a total of X links per page, if I find the links I want, I
add them to the list up until X, if I don' reach X, I add other links
until X is reached. This way, I don't waste crawl time on non-relevant
links.
You can modify the collection of target links passed to
distributeScoreToOutlinks() - this way you can affect both which links
are stored and what kind of metadata each of them gets.
As I said, you can also use just plain URLFilters to filter out unwanted
links, but that API gives you much less control because it's a simple
yes/no that considers just URL string. The advantage is that it's much
easier to implement than a ScoringFilter.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com