[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

Ken Krugler (JIRA) Thu, 28 May 2009 19:43:12 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714277#action_12714277
 ]


Ken Krugler commented on NUTCH-739:
-----------------------------------

There's another approach that works well here, and that's to start up a thread 
that calls the Hadoop reporter while the optimize is happening.

We ran into the same issue when optimizing large Lucene indexes from our Bixo 
IndexScheme tap for Cascading. You can find that code on GitHub, but the 
skeleton is to do something like this in the reducer's close() method - 
assuming you've stashed the reporter from the reduce() call:

{code:java}
// Hadoop needs to know we still working on it.
Thread reporterThread = new Thread() {
        public void run() {
                while (!isInterrupted()) {
                        reporter.progress();
                        try {
                                sleep(10 * 1000);
                        } catch (InterruptedException e) {
                                interrupt();
                        }
                }
        }
};
reporterThread.start();

indexWriter.optimize();
<and other lengthy tasks here>
reporterThread.interrupt();
{code}



> SolrDeleteDuplications too slow when using hadoop
> -------------------------------------------------
>
>                 Key: NUTCH-739
>                 URL: https://issues.apache.org/jira/browse/NUTCH-739
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>         Environment: hadoop cluster with 3 nodes
> Map Task Capacity: 6
> Reduce Task Capacity: 6
> Indexer: one instance of solr server (on the one of slave nodes)
>            Reporter: Dmitry Lihachev
>             Fix For: 1.1
>
>         Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch
>
>
> in my environment i always have many warnings like this on the dedup step
> {noformat}
> Task attempt_200905270022_0212_r_000003_0 failed to report status for 600 
> seconds. Killing!
> {noformat}
> solr logs:
> {noformat}
> INFO: [] webapp=/solr path=/update 
> params={wt=javabin&waitFlush=true&optimize=true&waitSearcher=true&maxSegments=1&version=2.2}
>  status=0 QTime=173741
> May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
> finish
> INFO: {optimize=} 0 173599
> May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update 
> params={wt=javabin&waitFlush=true&optimize=true&waitSearcher=true&maxSegments=1&version=2.2}
>  status=0 QTime=173599
> May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
> INFO: Closing searc...@2ad9ac58 main
> May 27, 2009 10:29:27 AM 
> org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
> WARNING: Could not getStatistics on info bean 
> org.apache.solr.search.SolrIndexSearcher
> org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
> ....
> {noformat}
> So I think the problem in the piece of code on line 301 of 
> SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
> each of ones tries to optimize solr indexes before closing.
> The simplest way to avoid this bug - removing this line and sending 
> "<optimize/>" message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

Reply via email to