Andrzej Bialecki wrote:
Stefan Groschupf wrote:
I notice filtering urls is done in the output format until parsing.
Wouldn't it be better to filter it until updating crawlDb?
"Until" == "during" ?
As you observed, doing it at this stage saves space in segment data, and
in consequence saves on processing time (no CPU/IO needed to process
useless data, throw away junk as soon as possible).
I think it is better to not filter at parse time, but at db insert time.
This way if desired urls are accidentally filtered out then one only
has to re-update the db to include them rather than re-parse and re-update.
Doug