[jira] [Comment Edited] (SOLR-7188) Run Data Import Handler processes in a SolrJ client

Tim Allison (JIRA) Wed, 04 Mar 2015 10:30:07 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347319#comment-14347319
 ]


Tim Allison edited comment on SOLR-7188 at 3/4/15 6:29 PM:
-----------------------------------------------------------

I'm fairly new to Solr, but from experience on TIKA-1302, anything you can do 
to encapsulate Tika's jvm from everything else is a great idea.  No matter how 
hard we try over on Tika, parsers will misbehave, go OOM and/or have permanent 
hangs.  These happen fairly rarely, but when they do, they're a showstopper for 
anything in Tika's jvm.

We recently added an EvilParser to our tika-parser test suite, and if you are 
interested in hardening DIH against these issues, that parser might be useful 
for testing. 

Another thought would be to use tika-server (JAX-RS) and call out to that from 
DIH.  In the next few months, we plan to harden that so that it will shutdown 
on oom/permahang and a parent process will restart it.


was (Author: [email protected]):
I'm fairly new to Solr, but from experience on TIKA-1302, anything you can do 
to encapsulate Tika's jvm from everything else is a great idea.  No matter how 
hard we try over on Tika, parsers will misbehave, go OOM and/or have permanent 
hangs.  These happen fairly rarely, but when they do, they're a showstopper for 
anything in Tika's jvm.

We recently added an EvilParser to our tika-parser test suite, and if you are 
interested in hardening DIH against these issues, that parser might be useful 
for testing. 

> Run Data Import Handler processes in a SolrJ client
> ---------------------------------------------------
>
>                 Key: SOLR-7188
>                 URL: https://issues.apache.org/jira/browse/SOLR-7188
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>            Reporter: Ted Sullivan
>         Attachments: SOLR-7188.patch, SOLR-7188.patch
>
>
> Adds a DataImportHandlerClient class that wraps an EmbeddedSolrServer and 
> adds a DIHCloudWriter implementation of DIHWriter that sends documents to a 
> remote SolrCloud cluster.  This enables existing DIH processes to run outside 
> of the Solr JVM which should enable better scalability.
> The current architecture of DIH imposes several restrictions on scalability. 
> First, the DIH runs in the same process space as Solr itself and competes for 
> resources (CPU and memory) with normal Solr processes devoted to indexing and 
> querying. Second, the DIH cannot be multi-threaded which means that 
> parallelizing it requires splitting the processing amongst nodes in a 
> SolrCloud cluster. Since the incoming data is sent through an 
> UpdateRequestProcessor chain (via the SolrWriter implementation of 
> DIHWriter), additional routing is done internally as the documents are 
> forwarded to the current shard leader nodes once the ID hash is computed. 
> This causes additional network traffic within the SolrCloud cluster. Scaling 
> the DIH is limited by the number of nodes in the cluster and any heavy-duty 
> processing due to entity processors or transformation elements shares the 
> processing resources of Solr itself. This is known to be a source of 
> bottlenecks in Solr installations (SolrCloud or Master-Slave) that use DIH.
> The DataImportHandlerClient uses native DIH functionality - DataImporter, 
> etc. but can be run externally to Solr. This means that as many processes as 
> are needed to achieve necessary performance at scale can be added and the 
> processing that occurs within the DataImportHandler is done outside of the 
> Solr JVM. The same benefits that accrue with multiple SolrJ clients can now 
> be realized with DIH without the necessity of porting code from DIH to a 
> SolrJ client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7188) Run Data Import Handler processes in a SolrJ client

Reply via email to