Andrzej,

How would I check for a flag during fetch?

Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non- relevant links.

Thanks,

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosg...@calpoly.edu, e...@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/eosgood, www.lakemeadonline.com


On Oct 6, 2009, at 1:04 PM, Andrzej Bialecki wrote:

Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem.

Yes, look at ParseOutputFormat, you can make this decision there. There are two standard etension points where you can hook up - URLFilters and ScoringFilters.

Please note that if you use URLFilters to filter out URL-s too early then they will be rediscovered again and again. A better method to handle this, but also more complicated, is to still include such links but give them a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply via email to