Stefan Groschupf wrote:
I notice filtering urls is done in the output format until parsing.
Wouldn't it be better to filter it until updating crawlDb?
"Until" == "during" ?
As you observed, doing it at this stage saves space in segment data, and
in consequence saves on processing time (no CPU/IO needed to process
useless data, throw away junk as soon as possible).
Sure it would require to have some more disk space but since parsing
is done until fetching it may be improve fetching speed.
Parsing is not always done at fetching stage (Fetcher.parsing == false).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com