Andrzej Bialecki wrote:
Stefan Groschupf wrote:

I notice filtering urls is done in the output format until parsing. Wouldn't it be better to filter it until updating crawlDb?


"Until" == "during" ?

As you observed, doing it at this stage saves space in segment data, and in consequence saves on processing time (no CPU/IO needed to process useless data, throw away junk as soon as possible).

I think it is better to not filter at parse time, but at db insert time. This way if desired urls are accidentally filtered out then one only has to re-update the db to include them rather than re-parse and re-update.

Doug

Reply via email to