Following the instructions in the Nutch tutorial, I downloaded the DMOZ file content.rdf.u8, which is roughly 2GB and has 36.9M entries. I then ran the command to grab a subset of those URLS:
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls According to the Tuturial, the -subset 5000 option tells it to grab one out of every 5000 URLs. As such, I was expecting to see somwhere in the neighborhood of 7000 (36.9M/5000=7380) URLs in the dmoz/urls file. However, instead the file contained only 906 URLs. For grins, I thought I'd try another run at it using -subset 7500. The result was 587 URLs. There is no information in the log files, fyi. When comparing the numbers above it's almost like DmozParser is taking the subset value, multiplying it by a number slightly larger than 8 and then grabbing one out of every <subset>*8.2 lines. 36.9M/(5000*8.2) = 900 36.9M/(7500*8.2) = 615 Any hints? I sure hope I'm not being obtuse, but I've looked around a bit for more info, to no avail. Also, I realize this seems like picking nits, because if I'm looking for 7000 URLs, I can simpy adjust the submit number using my math above, but I'd just like to make sure I understand things... Also, as many know the NutchTutorial page on the Nutch Wiki is not up-to-date with 0.8. Is there any chance that if I rewrite it and send the diffs to someone, they'll actually get applied to the Wiki? I'm also more than willing to change the page directly (it IS a wiki), but can't seem to figure out how! Again, could be the obtuse thing... Best, Andy
