[jira] [Created] (SOLR-6199) SolrJ, using SolrInputDocument methods, requires entire document to be loaded into memory

Karl Wright (JIRA) Wed, 25 Jun 2014 01:01:18 -0700

Karl Wright created SOLR-6199:
---------------------------------

             Summary: SolrJ, using SolrInputDocument methods, requires entire 
document to be loaded into memory
                 Key: SOLR-6199
                 URL: https://issues.apache.org/jira/browse/SOLR-6199
             Project: Solr
          Issue Type: Bug
    Affects Versions: 4.7.3
            Reporter: Karl Wright



ManifoldCF has historically used Solr's extracting update handler for 
transmitting binary documents to Solr.  Recently, we've included Tika 
processing of binary documents, and wanted instead to send an (unlimited by 
ManifoldCF) character stream as a primary content field to Solr instead.  
Unfortunately, it appears that the SolrInputDocument metaphor for receiving 
extracted content and metadata requires that all fields be completely converted 
to String objects.  This will cause ManifoldCF to certainly run out of memory 
at some point, when multiple ManifoldCF threads all try to convert large 
documents to in-memory strings at the same time.

I looked into what would be needed to add streaming support to UpdateRequest 
and SolrInputDocument.  Basically, a legal option would be to set a field value 
that would be a Reader or a Reader[].  It would be straightforward to implement 
this, EXCEPT for the fact that SolrCloud apparently makes UpdateRequest copies, 
and copying a Reader isn't going to work unless there's a backing solid object 
somewhere.  Even then, I could have gotten this to work by using a temporary 
file for large streams, but there's no signal from SolrCloud when it is done 
with its copies of UpdateRequest, so there's no place to free any backing 
storage.

If anyone knows a good way to do non-extracting updates without loading entire 
documents into memory, please let me know.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-6199) SolrJ, using SolrInputDocument methods, requires entire document to be loaded into memory

Reply via email to