[ 
https://issues.apache.org/jira/browse/SOLR-11869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330492#comment-16330492
 ] 

Dirk Rudolph commented on SOLR-11869:
-------------------------------------

Basically its easy to implement that kind of UpdateRequestProcessor with the 
way UpdateRequestHandler works at the moment. The only missing piece is to 
prevent the entire document being loaded to memory. Its the 
DocumentBuilder#toDocument() that gets the already processed SolrInputDocument 
at the end of an UpdateProcessorChain. It then calls DocumentBuilder#addField() 
which gets the schema defined field type and [creates a new field for each of 
the 
values|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java#L66].
 The FieldType#createField() method then does some processing on the value but 
at the end it calls [Field#Field(String, String, 
IndexableFieldType)|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/FieldType.java#L304],
 so the value at the end is a string and therefor I would have to read the 
entire document into memory. 

Luckily Field has also a constructor accepting a Reader, so as a first step to 
make it possible to read fields not only from in memory data but also streamed, 
I would like propose to let FieldType#createField() properly handle Reader 
instances as value and then overwrite the behaviour of how a Reader is consumed 
for the TextField and maybe StringField (not sure if it makes sense for others 
too). 

Thoughts?

> Remote streaming UpdateRequestProcessor
> ---------------------------------------
>
>                 Key: SOLR-11869
>                 URL: https://issues.apache.org/jira/browse/SOLR-11869
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: UpdateRequestProcessors
>            Reporter: Dirk Rudolph
>            Priority: Minor
>
> When indexing documents from content management systems (or digital asset 
> management systems) they usually have fields for metadata given by an editor 
> and they in case of pdfs, docx or any other text formats may also contain the 
> binary content as well, which might be parsed to plain text using tika. This 
> is whats currently supported by the ExtractingRequestHandler. 
> We are now facing situations where we are indexing batches of documents using 
> the UpdateRequestHandler and want to send the binary content of the documents 
> mentioned above as part of the single request to the UpdateRequestHandler. As 
> those documents might be of unknown size and its difficult to send streams 
> along the wire with javax.json APIs, I though about sending the url to the 
> document itself, let solr fetch the document and let it be parsed by tika - 
> using a RemoteStreamingUpdateRequestProcessor.  
> Example:
> {code:json}
> { 
>  "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short text" }
>  "add": { "id": "doc2", "meta": "will become long", "text_ref": "http://..."; }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to