Karl Wright created SOLR-6199:
---------------------------------
Summary: SolrJ, using SolrInputDocument methods, requires entire
document to be loaded into memory
Key: SOLR-6199
URL: https://issues.apache.org/jira/browse/SOLR-6199
Project: Solr
Issue Type: Bug
Affects Versions: 4.7.3
Reporter: Karl Wright
ManifoldCF has historically used Solr's extracting update handler for
transmitting binary documents to Solr. Recently, we've included Tika
processing of binary documents, and wanted instead to send an (unlimited by
ManifoldCF) character stream as a primary content field to Solr instead.
Unfortunately, it appears that the SolrInputDocument metaphor for receiving
extracted content and metadata requires that all fields be completely converted
to String objects. This will cause ManifoldCF to certainly run out of memory
at some point, when multiple ManifoldCF threads all try to convert large
documents to in-memory strings at the same time.
I looked into what would be needed to add streaming support to UpdateRequest
and SolrInputDocument. Basically, a legal option would be to set a field value
that would be a Reader or a Reader[]. It would be straightforward to implement
this, EXCEPT for the fact that SolrCloud apparently makes UpdateRequest copies,
and copying a Reader isn't going to work unless there's a backing solid object
somewhere. Even then, I could have gotten this to work by using a temporary
file for large streams, but there's no signal from SolrCloud when it is done
with its copies of UpdateRequest, so there's no place to free any backing
storage.
If anyone knows a good way to do non-extracting updates without loading entire
documents into memory, please let me know.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]