nutch trunk filtering URLs in invertlinks even if -noFilter is on?

Brian Whitman Sat, 22 Sep 2007 12:37:47 -0700

We recently upgraded from a late 06 nightly of nutch to trunk, andmost things have been working faster and stabler.

However, there is one catch: we have a "readlinkdb" call in our crawlprocess as we want to catalog links to a binary file type (say .png)that other programs of ours can try to download and parse.

We have .png in our url filters because we don't want nutch to try todownload these files, but we do want the linkdb to note them.


In our old crawl script, we did:

bin/nutch invertlinks crawl/linkdb -dir crawl/segment -noFilter
bin/nutch readlinkdb crawl/linkdb -dump linkdb_dump

which worked fine and there were many .png files in the dump.However, with trunk, this doesn't seem to be the case anymore. Thereare no .png files in the linkdb dump, only html (pretty much the onlyfiletype we allow nutch to download.)


Is this intended? Am I doing something wrong?

nutch trunk filtering URLs in invertlinks even if -noFilter is on?

Reply via email to