[jira] [Commented] (NUTCH-2230) Nutch doesn't index all URLs found

Aaron Cosand (JIRA) Wed, 23 Mar 2016 07:26:56 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208481#comment-15208481
 ]


Aaron Cosand commented on NUTCH-2230:
-------------------------------------

The mongodb implementation of GORA assumes that data will be received in sorted 
order by the _id (primary key) field.  On versions of mongodb using the MMap 
storage engine, this assumption is true, but the WiredTiger (and presumably 
other storage engine possibilities) this is not true.  While the best fix is a 
correction to the GORA mongodb implementation, the below modification to 
org.apache.nutch.storage.StorageUtils should cause current versions of mongo to 
process the query using  an index scan that will cause the order of data to 
match the assumptions that GORA makes.  The single insertion is 
'query.setStartKey("");'.  The GORA mongo implementation converts this into 
{_id:{$gte:""}} which yields all records in the collections, in sorted order

  public static <K, V> void initMapperJob(Job job,
      Collection<WebPage.Field> fields, Class<K> outKeyClass,
      Class<V> outValueClass,
      Class<? extends GoraMapper<String, WebPage, K, V>> mapperClass,
      Class<? extends Partitioner<K, V>> partitionerClass,
      Filter<String, WebPage> filter, boolean reuseObjects)
      throws ClassNotFoundException, IOException {
    DataStore<String, WebPage> store = createWebStore(job.getConfiguration(),
        String.class, WebPage.class);
    if (store == null)
      throw new RuntimeException("Could not create datastore");
    Query<String, WebPage> query = store.newQuery();
    query.setFields(toStringArray(fields));
    if (filter != null) {
      query.setFilter(filter);
    }
    query.setStartKey("");
    GoraMapper.initMapperJob(job, query, store, outKeyClass, outValueClass,
        mapperClass, partitionerClass, reuseObjects);
    GoraOutputFormat.setOutput(job, store, true);
  }


> Nutch doesn't index all URLs found
> ----------------------------------
>
>                 Key: NUTCH-2230
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2230
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.3.1
>         Environment: MongoDB with WiredTiger storage engine (3.2 but probably 
> affects other versions as well)
>            Reporter: Aaron Cosand
>
> The initial query run by the generator task, against mongodb, doesn't force 
> ordering by _id.  This causes an incorrect selection of ranges for successive 
> map-reduce related queries.  The successive queries do appear to be getting 
> run in the correct order since _id is always indexed, but they should also 
> explicitly specify a sort, since you are not guaranteed a particular order 
> otherwise.  I didn't dig deep enough to see if the root of the problem is 
> with nutch or gora, and whether it only affected mongo or could affect other 
> databases as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2230) Nutch doesn't index all URLs found

Reply via email to