Andrzej,
How would I check for a flag during fetch?
Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but
still needing a total of X links per page, if I find the links I want,
I add them to the list up until X, if I don' reach X, I add other
links until X is reached. This way, I don't waste crawl time on non-
relevant links.
Thanks,
Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosg...@calpoly.edu, e...@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/eosgood, www.lakemeadonline.com
On Oct 6, 2009, at 1:04 PM, Andrzej Bialecki wrote:
Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per
page and then at that point choose which links I want to include /
exclude? that is the ideal remedy to my problem.
Yes, look at ParseOutputFormat, you can make this decision there.
There are two standard etension points where you can hook up -
URLFilters and ScoringFilters.
Please note that if you use URLFilters to filter out URL-s too early
then they will be rediscovered again and again. A better method to
handle this, but also more complicated, is to still include such
links but give them a special flag (in metadata) that prevents
fetching. This requires that you implement a custom scoring plugin.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com