Re: Solr Extracting request handler

Karl Wright Tue, 17 Jun 2014 18:56:29 -0700

Hi Abe-san,

Tika jars are not very big:


C:\wip\mcf\trunk\lib>dir tika*
 Volume in drive C has no label.
 Volume Serial Number is 002E-D1F0

 Directory of C:\wip\mcf\trunk\lib

06/05/2014  08:21 AM           493,374 tika-core.jar
06/05/2014  08:21 AM           523,677 tika-parsers.jar
               2 File(s)      1,017,051 bytes
               0 Dir(s)  140,792,315,904 bytes free

The entire lib directory is 85M:

85,156,330 bytes

The built binary image is still about 185Mb, I believe.  So I don't know
why you think it is >1Gb?  Temporary class files?  I don't think we can
avoid those.

I'd rather not make things more complicated than they need to be by adding
a new required service - even though it would fit naturally with the
connector arrangement.

Karl





On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <[email protected]>
wrote:

> Hi Karl,
>
> Okay, I assumed Tika connector outputs files.
> If we post character data metadata got from Tika, "/update/extract" handler
> can handle this(provides params:
> literal.content=value&literal.metaField=foobar
> with using NullInputStream for binary data like CONNECTORS-936).
>
> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
> connector uses Tika jars.
> Tika connector and CloudSearch connector should extract text via
> tika-server[1]
> and MCF should not have many Tika jars, do you think?
>
> [1]
> http://wiki.apache.org/tika/TikaJAXRS
>
> Thanks,
> Shinichiro Abe
>
> On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote:
>
> > Hi Abe-san,
> >
> > It sounds like you might be thinking that transformation connectors are
> > like output connectors.  Just so we are clear, transformation connectors
> in
> > 1.7 receive a RepositoryDocument as input, and then pass a
> > RepositoryDocument on to the next connector in the chain.  So I don't
> know
> > why .xml files would be involved.  I'd expect the Tika connector to read
> a
> > binary file from one RepositoryDocument object and convert its contents
> to
> > another RepositoryDocument object which would have character data and
> > metadata only.  Would this work for your case, do you think?
> >
> > Karl
> >
> >
> >
> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> [email protected]>
> > wrote:
> >
> >> Hi Karl,
> >>
> >> Yes. I thought the standard update handler met that requirement.
> >> For instance, Tika extractor transformation connector creates two files.
> >> 1. addtoSolr.xml for add and update
> >> 2. deletetoSolr.xml for delete
> >> File connector ingests these xml files, then Solr connector posts these
> >> files by "/update" handler.
> >>
> >> In the the Solr Connector, other function as to update handler
> >> might not be necessary except for  "/update" handler.
> >>
> >> Thanks,
> >> Shinichiro Abe
> >>
> >> On 2014/06/18, at 8:02, Karl Wright <[email protected]> wrote:
> >>
> >>> Hi Abe-san,
> >>>
> >>> So just to be sure -- you believe that no changes at all are required
> to
> >>> the Solr Connector as it stands now, other than to use the update
> handler
> >>> rather than the /update/extract handler?
> >>>
> >>> Karl
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> >> [email protected]>
> >>> wrote:
> >>>
> >>>>> As for changing the Solr connector so that it doesn't go to the
> >> extracting
> >>>> update handler
> >>>>
> >>>> I don't think it needs to change Solr connector with new checkbox
> >> because
> >>>> currently we can change "/update/extract" into "/update" at 'Update
> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post
> >> CSV,
> >>>> JSON and XML files to Solr by changing that and using File connector.
> >> So I
> >>>> wish we allow Tika extractor transformation connector to create XML
> >> files
> >>>> that Solr expects to see.
> >>>>
> >>>> Regards,
> >>>> Shinichiro Abe
> >>>>
> >>>>
> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected]>:
> >>>>
> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said
> they'd
> >>>>> contribute a Tika extractor transformation connector - and if they
> >> don't
> >>>>> get around to that in a month or so, I may take a crack at it myself.
> >>>>>
> >>>>> As for changing the Solr connector so that it doesn't go to the
> >>>> extracting
> >>>>> update handler, it would be great if:
> >>>>> (1) Someone created a ticket for this, and
> >>>>> (2) A patch was provided that maintains backwards compatibility with
> >>>>> previous versions of the connector (so a checkbox would probably need
> >> to
> >>>> go
> >>>>> into the UI somewhere).  Do either of you want to start this process?
> >>>>>
> >>>>> Thanks!
> >>>>> Karl
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>>> Hi guys,
> >>>>>>
> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
> pipeline,
> >>>> and
> >>>>>> is expected to have a Tika extractor as a transformation connector.
> >>>>>>
> >>>>>> Karl
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> >>>>> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thanks Alessandro,
> >>>>>>>       that explains the situation clearly.
> >>>>>>> And I agree that sending all the metadata as get parameter can be
> >>>>>>> problematic
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>> --
> >>>>>>> Matteo Grolla
> >>>>>>> Sourcesense - making sense of Open Source
> >>>>>>> http://www.sourcesense.com
> >>>>>>>
> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
> >>>> scritto:
> >>>>>>>
> >>>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
> >>>>>>>> The Repository connectors extracts directly the binary and there
> is
> >>>> no
> >>>>>>>> "Extractor Processor" yet.
> >>>>>>>> But recently a pipe-line processor architecture has been thought (
> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> >>>>>>>> So can fit there.
> >>>>>>>>
> >>>>>>>> Cheers
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> [email protected]
> >>>>> :
> >>>>>>>>
> >>>>>>>>> Since Solr extracting request handler takes the binary and
> extracts
> >>>>>>> text
> >>>>>>>>> what is the point of not using Manifold extractor and send text
> and
> >>>>>>>>> binaries to solr?
> >>>>>>>>> I mean the end result is the same solr indexes text and stores
> text
> >>>>>>>>> So if manifold supports text extraction it seems me this is the
> >>>> place
> >>>>>>>>> where it should be done
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Matteo Grolla
> >>>>>>>>> Sourcesense - making sense of Open Source
> >>>>>>>>> http://www.sourcesense.com
> >>>>>>>>>
> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
> Morales
> >>>> ha
> >>>>>>>>> scritto:
> >>>>>>>>>
> >>>>>>>>>> Hi Matteo
> >>>>>>>>>>
> >>>>>>>>>> Manifold already handles the extraction, but the only way to
> send
> >>>>>>> binary
> >>>>>>>>>> content and document metadata to Solr is using the
> update/extract
> >>>>>>>>> handler,
> >>>>>>>>>> where the metadata is sent as query parameters and the binary
> >>>>> content
> >>>>>>> is
> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
> >>>>> obtain
> >>>>>>> the
> >>>>>>>>>> raw content to be stored in Solr.
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> >>>>>>> [email protected]
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
> >>>>>>> extracting
> >>>>>>>>>>> request handler to extract the content of an xml file
> >>>>>>>>>>> For performance reasons it would be better if Manifold handled
> >>>> the
> >>>>>>>>>>> extraction letting Solr do the search engine
> >>>>>>>>>>> Is this because of the connector design, framework design or
> just
> >>>>> to
> >>>>>>> be
> >>>>>>>>>>> done?
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Matteo Grolla
> >>>>>>>>>>> Sourcesense - making sense of Open Source
> >>>>>>>>>>> http://www.sourcesense.com
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> ------------------------------
> >>>>>>>>>> This message should be regarded as confidential. If you have
> >>>>> received
> >>>>>>>>> this
> >>>>>>>>>> email in error please notify the sender and destroy it
> >>>> immediately.
> >>>>>>>>>> Statements of intent shall only become binding when confirmed in
> >>>>> hard
> >>>>>>>>> copy
> >>>>>>>>>> by an authorised signatory.
> >>>>>>>>>>
> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> registration
> >>>>>>> number
> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
> Bush
> >>>>>>> Road,
> >>>>>>>>>> London W6 7AN.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> --------------------------
> >>>>>>>>
> >>>>>>>> Benedetti Alessandro
> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> >>>>>>>>
> >>>>>>>> "Tyger, tyger burning bright
> >>>>>>>> In the forests of the night,
> >>>>>>>> What immortal hand or eye
> >>>>>>>> Could frame thy fearful symmetry?"
> >>>>>>>>
> >>>>>>>> William Blake - Songs of Experience -1794 England
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>> Shinichiro Abe
> >>>> 阿部 慎一朗
> >>>>
> >>
> >>
>
>

Re: Solr Extracting request handler

Reply via email to