I notice filtering urls is done in the output format until parsing. Wouldn't it be better to filter it until updating crawlDb?

"Until" == "during" ?
Sorry, yes during!

As you observed, doing it at this stage saves space in segment data, and in consequence saves on processing time (no CPU/IO needed to process useless data, throw away junk as soon as possible).
Make sense, thanks for the hint. I guess now with a published db filter tool for nutch .7 and .8 people will be able to clean up web- and crawl databases.

Stefan

Reply via email to