[ https://issues.apache.org/jira/browse/SOLR-11869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330528#comment-16330528 ]
David Smiley commented on SOLR-11869: ------------------------------------- If you wish to propose that Solr FieldType.createField and related plumbing work nicely with a Reader, then I think you should create an issue dedicated to that. Also keep in mind that such a field cannot be "stored", since at the Lucene level it's required it be fully materialized to a String or BytesRef. A further consequence of that is atomic-updates are not-possible. Another thing that could be considered is using a BytesRef as the stored value, and wrapping a reader around it for Lucene Analyzer/TokenStream parts. You wouldn't be truly streaming, but the RAM requirements should drop in half since you're working with UTF8 (usually 1-byte unicode characters) as opposed to a String (UTF16 usually 2-byte unicode characters). This may have some gotchas, like highlighting and stored data retrieval which is anticipating a String from Lucene, not raw bytes. BTW Lucene and Solr have code paths that recognize massive bytes<->char[] conversions and avoid over-allocating arrays by first computing how big the array on the other side needs to be by doing a preliminary pass to count the unicode chars. > Remote streaming UpdateRequestProcessor > --------------------------------------- > > Key: SOLR-11869 > URL: https://issues.apache.org/jira/browse/SOLR-11869 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: UpdateRequestProcessors > Reporter: Dirk Rudolph > Priority: Minor > > When indexing documents from content management systems (or digital asset > management systems) they usually have fields for metadata given by an editor > and they in case of pdfs, docx or any other text formats may also contain the > binary content as well, which might be parsed to plain text using tika. This > is whats currently supported by the ExtractingRequestHandler. > We are now facing situations where we are indexing batches of documents using > the UpdateRequestHandler and want to send the binary content of the documents > mentioned above as part of the single request to the UpdateRequestHandler. As > those documents might be of unknown size and its difficult to send streams > along the wire with javax.json APIs, I though about sending the url to the > document itself, let solr fetch the document and let it be parsed by tika - > using a RemoteStreamingUpdateRequestProcessor. > Example: > {code:json} > { > "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short text" } > "add": { "id": "doc2", "meta": "will become long", "text_ref": "http://..." } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org