Aaron Cosand created NUTCH-2230:
-----------------------------------
Summary: Nutch doesn't index all URLs found
Key: NUTCH-2230
URL: https://issues.apache.org/jira/browse/NUTCH-2230
Project: Nutch
Issue Type: Bug
Components: generator
Affects Versions: 2.3.1
Environment: MongoDB with WiredTiger storage engine (3.2 but probably
affects other versions as well)
Reporter: Aaron Cosand
The initial query run by the generator task, against mongodb, doesn't force
ordering by _id. This causes an incorrect selection of ranges for successive
map-reduce related queries. The successive queries do appear to be getting run
in the correct order since _id is always indexed, but they should also
explicitly specify a sort, since you are not guaranteed a particular order
otherwise. I didn't dig deep enough to see if the root of the problem is with
nutch or gora, and whether it only affected mongo or could affect other
databases as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)