Erick, any reason you didn't post this as a comment to JIRA? On Thu, Jan 18, 2018 at 10:58 AM Erick Erickson <erickerick...@gmail.com> wrote:
> Dirk: > > Just skimmed your first post. At a bit higher level, if you're running > Tika on the Solr server, that usually doesn't scale well for two > reasons > 1> it puts a lot of CPU intensive work on the Solr box > 2> Tika sometimes hits OOMs, loops and the like. It has to deal with a > _ton_ of wonky implementations of ill-defined specs. > > I'm not quite sure if this is germane to your question, but if so and > you can move your Tika processing off to an external client or service > that might be a better way to go... > > Best, > Erick > > On Thu, Jan 18, 2018 at 6:15 AM, Dirk Rudolph (JIRA) <j...@apache.org> > wrote: > > > > [ > https://issues.apache.org/jira/browse/SOLR-11869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330553#comment-16330553 > ] > > > > Dirk Rudolph commented on SOLR-11869: > > ------------------------------------- > > > > I see. So I will start without taking care of the document being fully > read into memory or not. > > > > Anyway, would that kind of UpdateRequestProcessor be interesting for > solr or am I the only one facing that use case? > > > >> Remote streaming UpdateRequestProcessor > >> --------------------------------------- > >> > >> Key: SOLR-11869 > >> URL: https://issues.apache.org/jira/browse/SOLR-11869 > >> Project: Solr > >> Issue Type: Improvement > >> Security Level: Public(Default Security Level. Issues are Public) > >> Components: UpdateRequestProcessors > >> Reporter: Dirk Rudolph > >> Priority: Minor > >> > >> When indexing documents from content management systems (or digital > asset management systems) they usually have fields for metadata given by an > editor and they in case of pdfs, docx or any other text formats may also > contain the binary content as well, which might be parsed to plain text > using tika. This is whats currently supported by the > ExtractingRequestHandler. > >> We are now facing situations where we are indexing batches of documents > using the UpdateRequestHandler and want to send the binary content of the > documents mentioned above as part of the single request to the > UpdateRequestHandler. As those documents might be of unknown size and its > difficult to send streams along the wire with javax.json APIs, I though > about sending the url to the document itself, let solr fetch the document > and let it be parsed by tika - using a > RemoteStreamingUpdateRequestProcessor. > >> Example: > >> {code:json} > >> { > >> "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short > text" } > >> "add": { "id": "doc2", "meta": "will become long", "text_ref": > >> "http://..." > } > >> } > >> {code} > > > > > > > > -- > > This message was sent by Atlassian JIRA > > (v7.6.3#76005) > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com