[
https://issues.apache.org/jira/browse/NUTCH-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208813#comment-15208813
]
Aaron Cosand commented on NUTCH-2230:
-------------------------------------
This appears to be an underlying problem with the way the mongodb
implementation of the GORA query mechanism works. It seems to incorrectly
assume that data is always received in the order of the primary key (true for
the MMAP storage engine, I believe, but not others).
> Nutch doesn't index all URLs found
> ----------------------------------
>
> Key: NUTCH-2230
> URL: https://issues.apache.org/jira/browse/NUTCH-2230
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 2.3.1
> Environment: MongoDB with WiredTiger storage engine (3.2 but probably
> affects other versions as well)
> Reporter: Aaron Cosand
>
> The initial query run by the generator task, against mongodb, doesn't force
> ordering by _id. This causes an incorrect selection of ranges for successive
> map-reduce related queries. The successive queries do appear to be getting
> run in the correct order since _id is always indexed, but they should also
> explicitly specify a sort, since you are not guaranteed a particular order
> otherwise. I didn't dig deep enough to see if the root of the problem is
> with nutch or gora, and whether it only affected mongo or could affect other
> databases as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)