Erick, any reason you didn't post this as a comment to JIRA?

On Thu, Jan 18, 2018 at 10:58 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> Dirk:
>
> Just skimmed your first post. At a bit higher level, if you're running
> Tika on the Solr server, that usually doesn't scale well for two
> reasons
> 1> it puts a lot of CPU intensive work on the Solr box
> 2> Tika sometimes hits OOMs, loops and the like. It has to deal with a
> _ton_ of wonky implementations of ill-defined specs.
>
> I'm not quite sure if this is germane to your question, but if so and
> you can move your Tika processing off to an external client or service
> that might be a better way to go...
>
> Best,
> Erick
>
> On Thu, Jan 18, 2018 at 6:15 AM, Dirk Rudolph (JIRA) <j...@apache.org>
> wrote:
> >
> >     [
> https://issues.apache.org/jira/browse/SOLR-11869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330553#comment-16330553
> ]
> >
> > Dirk Rudolph commented on SOLR-11869:
> > -------------------------------------
> >
> > I see. So I will start without taking care of the document being fully
> read into memory or not.
> >
> > Anyway, would that kind of UpdateRequestProcessor be interesting for
> solr or am I the only one facing that use case?
> >
> >> Remote streaming UpdateRequestProcessor
> >> ---------------------------------------
> >>
> >>                 Key: SOLR-11869
> >>                 URL: https://issues.apache.org/jira/browse/SOLR-11869
> >>             Project: Solr
> >>          Issue Type: Improvement
> >>      Security Level: Public(Default Security Level. Issues are Public)
> >>          Components: UpdateRequestProcessors
> >>            Reporter: Dirk Rudolph
> >>            Priority: Minor
> >>
> >> When indexing documents from content management systems (or digital
> asset management systems) they usually have fields for metadata given by an
> editor and they in case of pdfs, docx or any other text formats may also
> contain the binary content as well, which might be parsed to plain text
> using tika. This is whats currently supported by the
> ExtractingRequestHandler.
> >> We are now facing situations where we are indexing batches of documents
> using the UpdateRequestHandler and want to send the binary content of the
> documents mentioned above as part of the single request to the
> UpdateRequestHandler. As those documents might be of unknown size and its
> difficult to send streams along the wire with javax.json APIs, I though
> about sending the url to the document itself, let solr fetch the document
> and let it be parsed by tika - using a
> RemoteStreamingUpdateRequestProcessor.
> >> Example:
> >> {code:json}
> >> {
> >>  "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short
> text" }
> >>  "add": { "id": "doc2", "meta": "will become long", "text_ref": 
> >> "http://...";
> }
> >> }
> >> {code}
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v7.6.3#76005)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Reply via email to