[
https://issues.apache.org/jira/browse/SOLR-11869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexandre Rafalovitch closed SOLR-11869.
----------------------------------------
> Remote streaming UpdateRequestProcessor
> ---------------------------------------
>
> Key: SOLR-11869
> URL: https://issues.apache.org/jira/browse/SOLR-11869
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: UpdateRequestProcessors
> Reporter: Dirk Rudolph
> Priority: Minor
>
> When indexing documents from content management systems (or digital asset
> management systems) they usually have fields for metadata given by an editor
> and they in case of pdfs, docx or any other text formats may also contain the
> binary content as well, which might be parsed to plain text using tika. This
> is whats currently supported by the ExtractingRequestHandler.
> We are now facing situations where we are indexing batches of documents using
> the UpdateRequestHandler and want to send the binary content of the documents
> mentioned above as part of the single request to the UpdateRequestHandler. As
> those documents might be of unknown size and its difficult to send streams
> along the wire with javax.json APIs, I though about sending the url to the
> document itself, let solr fetch the document and let it be parsed by tika -
> using a RemoteStreamingUpdateRequestProcessor.
> Example:
> {code:json}
> {
> "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short text" }
> "add": { "id": "doc2", "meta": "will become long", "text_ref": "http://..." }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]