[jira] [Updated] (SOLR-6199) SolrJ, using SolrInputDocument methods, requires entire document to be loaded into memory

Karl Wright (JIRA) Wed, 25 Jun 2014 17:05:00 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Karl Wright updated SOLR-6199:
------------------------------

    Component/s: clients - java

> SolrJ, using SolrInputDocument methods, requires entire document to be loaded 
> into memory
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-6199
>                 URL: https://issues.apache.org/jira/browse/SOLR-6199
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 4.7.3
>            Reporter: Karl Wright
>
> ManifoldCF has historically used Solr's extracting update handler for 
> transmitting binary documents to Solr.  Recently, we've included Tika 
> processing of binary documents, and wanted instead to send an (unlimited by 
> ManifoldCF) character stream as a primary content field to Solr instead.  
> Unfortunately, it appears that the SolrInputDocument metaphor for receiving 
> extracted content and metadata requires that all fields be completely 
> converted to String objects.  This will cause ManifoldCF to certainly run out 
> of memory at some point, when multiple ManifoldCF threads all try to convert 
> large documents to in-memory strings at the same time.
> I looked into what would be needed to add streaming support to UpdateRequest 
> and SolrInputDocument.  Basically, a legal option would be to set a field 
> value that would be a Reader or a Reader[].  It would be straightforward to 
> implement this, EXCEPT for the fact that SolrCloud apparently makes 
> UpdateRequest copies, and copying a Reader isn't going to work unless there's 
> a backing solid object somewhere.  Even then, I could have gotten this to 
> work by using a temporary file for large streams, but there's no signal from 
> SolrCloud when it is done with its copies of UpdateRequest, so there's no 
> place to free any backing storage.
> If anyone knows a good way to do non-extracting updates without loading 
> entire documents into memory, please let me know.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-6199) SolrJ, using SolrInputDocument methods, requires entire document to be loaded into memory

Reply via email to