[
https://issues.apache.org/jira/browse/SOLR-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367251#comment-14367251
]
Ted Sullivan commented on SOLR-7188:
------------------------------------
I have refactored the DIH code using [~noblepaul] AbstractionLayer interface
idea. All tests pass now. Currently, the DIH client code is in a sub directory
in org/apache/solr/handler/dataimport/client - It may make sense to move this
code to SolrJ for jar packaging etc. That would probably require a new JIRA
ticket - for the refactoring / abstraction layer piece. Not sure what the best
strategy is at this point.
TO DOs: There is one bit of code needed in the ClientAbstractionLayer to
reproduce the IndexSchema without requiring SolrConfig. Also need a test case
for this since the original test case does not need to use this code.
> Run Data Import Handler processes in a SolrJ client
> ---------------------------------------------------
>
> Key: SOLR-7188
> URL: https://issues.apache.org/jira/browse/SOLR-7188
> Project: Solr
> Issue Type: Improvement
> Components: contrib - DataImportHandler
> Reporter: Ted Sullivan
> Attachments: IDEA-AS-CODE.patch, SOLR-7188.patch, SOLR-7188.patch
>
>
> Adds a DataImportHandlerClient class that wraps an EmbeddedSolrServer and
> adds a DIHCloudWriter implementation of DIHWriter that sends documents to a
> remote SolrCloud cluster. This enables existing DIH processes to run outside
> of the Solr JVM which should enable better scalability.
> The current architecture of DIH imposes several restrictions on scalability.
> First, the DIH runs in the same process space as Solr itself and competes for
> resources (CPU and memory) with normal Solr processes devoted to indexing and
> querying. Second, the DIH cannot be multi-threaded which means that
> parallelizing it requires splitting the processing amongst nodes in a
> SolrCloud cluster. Since the incoming data is sent through an
> UpdateRequestProcessor chain (via the SolrWriter implementation of
> DIHWriter), additional routing is done internally as the documents are
> forwarded to the current shard leader nodes once the ID hash is computed.
> This causes additional network traffic within the SolrCloud cluster. Scaling
> the DIH is limited by the number of nodes in the cluster and any heavy-duty
> processing due to entity processors or transformation elements shares the
> processing resources of Solr itself. This is known to be a source of
> bottlenecks in Solr installations (SolrCloud or Master-Slave) that use DIH.
> The DataImportHandlerClient uses native DIH functionality - DataImporter,
> etc. but can be run externally to Solr. This means that as many processes as
> are needed to achieve necessary performance at scale can be added and the
> processing that occurs within the DataImportHandler is done outside of the
> Solr JVM. The same benefits that accrue with multiple SolrJ clients can now
> be realized with DIH without the necessity of porting code from DIH to a
> SolrJ client.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]