We recently upgraded from a late 06 nightly of nutch to trunk, and most things have been working faster and stabler.

However, there is one catch: we have a "readlinkdb" call in our crawl process as we want to catalog links to a binary file type (say .png) that other programs of ours can try to download and parse.

We have .png in our url filters because we don't want nutch to try to download these files, but we do want the linkdb to note them.

In our old crawl script, we did:

bin/nutch invertlinks crawl/linkdb -dir crawl/segment -noFilter
bin/nutch readlinkdb crawl/linkdb -dump linkdb_dump

which worked fine and there were many .png files in the dump. However, with trunk, this doesn't seem to be the case anymore. There are no .png files in the linkdb dump, only html (pretty much the only filetype we allow nutch to download.)

Is this intended? Am I doing something wrong?


Reply via email to