[
https://issues.apache.org/jira/browse/NUTCH-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208481#comment-15208481
]
Aaron Cosand commented on NUTCH-2230:
-------------------------------------
The mongodb implementation of GORA assumes that data will be received in sorted
order by the _id (primary key) field. On versions of mongodb using the MMap
storage engine, this assumption is true, but the WiredTiger (and presumably
other storage engine possibilities) this is not true. While the best fix is a
correction to the GORA mongodb implementation, the below modification to
org.apache.nutch.storage.StorageUtils should cause current versions of mongo to
process the query using an index scan that will cause the order of data to
match the assumptions that GORA makes. The single insertion is
'query.setStartKey("");'. The GORA mongo implementation converts this into
{_id:{$gte:""}} which yields all records in the collections, in sorted order
public static <K, V> void initMapperJob(Job job,
Collection<WebPage.Field> fields, Class<K> outKeyClass,
Class<V> outValueClass,
Class<? extends GoraMapper<String, WebPage, K, V>> mapperClass,
Class<? extends Partitioner<K, V>> partitionerClass,
Filter<String, WebPage> filter, boolean reuseObjects)
throws ClassNotFoundException, IOException {
DataStore<String, WebPage> store = createWebStore(job.getConfiguration(),
String.class, WebPage.class);
if (store == null)
throw new RuntimeException("Could not create datastore");
Query<String, WebPage> query = store.newQuery();
query.setFields(toStringArray(fields));
if (filter != null) {
query.setFilter(filter);
}
query.setStartKey("");
GoraMapper.initMapperJob(job, query, store, outKeyClass, outValueClass,
mapperClass, partitionerClass, reuseObjects);
GoraOutputFormat.setOutput(job, store, true);
}
> Nutch doesn't index all URLs found
> ----------------------------------
>
> Key: NUTCH-2230
> URL: https://issues.apache.org/jira/browse/NUTCH-2230
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 2.3.1
> Environment: MongoDB with WiredTiger storage engine (3.2 but probably
> affects other versions as well)
> Reporter: Aaron Cosand
>
> The initial query run by the generator task, against mongodb, doesn't force
> ordering by _id. This causes an incorrect selection of ranges for successive
> map-reduce related queries. The successive queries do appear to be getting
> run in the correct order since _id is always indexed, but they should also
> explicitly specify a sort, since you are not guaranteed a particular order
> otherwise. I didn't dig deep enough to see if the root of the problem is
> with nutch or gora, and whether it only affected mongo or could affect other
> databases as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)