Hi Karl, Okay, I assumed Tika connector outputs files. If we post character data metadata got from Tika, "/update/extract" handler can handle this(provides params: literal.content=value&literal.metaField=foobar with using NullInputStream for binary data like CONNECTORS-936).
BTW, now trunk built size is too big(1G+). Maybe because CloudSearch connector uses Tika jars. Tika connector and CloudSearch connector should extract text via tika-server[1] and MCF should not have many Tika jars, do you think? [1] http://wiki.apache.org/tika/TikaJAXRS Thanks, Shinichiro Abe On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote: > Hi Abe-san, > > It sounds like you might be thinking that transformation connectors are > like output connectors. Just so we are clear, transformation connectors in > 1.7 receive a RepositoryDocument as input, and then pass a > RepositoryDocument on to the next connector in the chain. So I don't know > why .xml files would be involved. I'd expect the Tika connector to read a > binary file from one RepositoryDocument object and convert its contents to > another RepositoryDocument object which would have character data and > metadata only. Would this work for your case, do you think? > > Karl > > > > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <[email protected]> > wrote: > >> Hi Karl, >> >> Yes. I thought the standard update handler met that requirement. >> For instance, Tika extractor transformation connector creates two files. >> 1. addtoSolr.xml for add and update >> 2. deletetoSolr.xml for delete >> File connector ingests these xml files, then Solr connector posts these >> files by "/update" handler. >> >> In the the Solr Connector, other function as to update handler >> might not be necessary except for "/update" handler. >> >> Thanks, >> Shinichiro Abe >> >> On 2014/06/18, at 8:02, Karl Wright <[email protected]> wrote: >> >>> Hi Abe-san, >>> >>> So just to be sure -- you believe that no changes at all are required to >>> the Solr Connector as it stands now, other than to use the update handler >>> rather than the /update/extract handler? >>> >>> Karl >>> >>> >>> >>> >>> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe < >> [email protected]> >>> wrote: >>> >>>>> As for changing the Solr connector so that it doesn't go to the >> extracting >>>> update handler >>>> >>>> I don't think it needs to change Solr connector with new checkbox >> because >>>> currently we can change "/update/extract" into "/update" at 'Update >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post >> CSV, >>>> JSON and XML files to Solr by changing that and using File connector. >> So I >>>> wish we allow Tika extractor transformation connector to create XML >> files >>>> that Solr expects to see. >>>> >>>> Regards, >>>> Shinichiro Abe >>>> >>>> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected]>: >>>> >>>>> The pipeline code itself is now "complete" in trunk. Zaizi said they'd >>>>> contribute a Tika extractor transformation connector - and if they >> don't >>>>> get around to that in a month or so, I may take a crack at it myself. >>>>> >>>>> As for changing the Solr connector so that it doesn't go to the >>>> extracting >>>>> update handler, it would be great if: >>>>> (1) Someone created a ticket for this, and >>>>> (2) A patch was provided that maintains backwards compatibility with >>>>> previous versions of the connector (so a checkbox would probably need >> to >>>> go >>>>> into the UI somewhere). Do either of you want to start this process? >>>>> >>>>> Thanks! >>>>> Karl >>>>> >>>>> >>>>> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <[email protected]> >>>> wrote: >>>>> >>>>>> Hi guys, >>>>>> >>>>>> You folks may not have looked at 1.7 yet, but it has a full pipeline, >>>> and >>>>>> is expected to have a Tika extractor as a transformation connector. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla < >>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks Alessandro, >>>>>>> that explains the situation clearly. >>>>>>> And I agree that sending all the metadata as get parameter can be >>>>>>> problematic >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> -- >>>>>>> Matteo Grolla >>>>>>> Sourcesense - making sense of Open Source >>>>>>> http://www.sourcesense.com >>>>>>> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha >>>> scritto: >>>>>>> >>>>>>>> mmmm the point is that right now ManifoldCF has no extractors. >>>>>>>> The Repository connectors extracts directly the binary and there is >>>> no >>>>>>>> "Extractor Processor" yet. >>>>>>>> But recently a pipe-line processor architecture has been thought ( >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959) >>>>>>>> So can fit there. >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>>> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <[email protected] >>>>> : >>>>>>>> >>>>>>>>> Since Solr extracting request handler takes the binary and extracts >>>>>>> text >>>>>>>>> what is the point of not using Manifold extractor and send text and >>>>>>>>> binaries to solr? >>>>>>>>> I mean the end result is the same solr indexes text and stores text >>>>>>>>> So if manifold supports text extraction it seems me this is the >>>> place >>>>>>>>> where it should be done >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Matteo Grolla >>>>>>>>> Sourcesense - making sense of Open Source >>>>>>>>> http://www.sourcesense.com >>>>>>>>> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales >>>> ha >>>>>>>>> scritto: >>>>>>>>> >>>>>>>>>> Hi Matteo >>>>>>>>>> >>>>>>>>>> Manifold already handles the extraction, but the only way to send >>>>>>> binary >>>>>>>>>> content and document metadata to Solr is using the update/extract >>>>>>>>> handler, >>>>>>>>>> where the metadata is sent as query parameters and the binary >>>>> content >>>>>>> is >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to >>>>> obtain >>>>>>> the >>>>>>>>>> raw content to be stored in Solr. >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla < >>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr >>>>>>> extracting >>>>>>>>>>> request handler to extract the content of an xml file >>>>>>>>>>> For performance reasons it would be better if Manifold handled >>>> the >>>>>>>>>>> extraction letting Solr do the search engine >>>>>>>>>>> Is this because of the connector design, framework design or just >>>>> to >>>>>>> be >>>>>>>>>>> done? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Matteo Grolla >>>>>>>>>>> Sourcesense - making sense of Open Source >>>>>>>>>>> http://www.sourcesense.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> ------------------------------ >>>>>>>>>> This message should be regarded as confidential. If you have >>>>> received >>>>>>>>> this >>>>>>>>>> email in error please notify the sender and destroy it >>>> immediately. >>>>>>>>>> Statements of intent shall only become binding when confirmed in >>>>> hard >>>>>>>>> copy >>>>>>>>>> by an authorised signatory. >>>>>>>>>> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the registration >>>>>>> number >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush >>>>>>> Road, >>>>>>>>>> London W6 7AN. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -------------------------- >>>>>>>> >>>>>>>> Benedetti Alessandro >>>>>>>> Visiting card : http://about.me/alessandro_benedetti >>>>>>>> >>>>>>>> "Tyger, tyger burning bright >>>>>>>> In the forests of the night, >>>>>>>> What immortal hand or eye >>>>>>>> Could frame thy fearful symmetry?" >>>>>>>> >>>>>>>> William Blake - Songs of Experience -1794 England >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >>>> Shinichiro Abe >>>> 阿部 慎一朗 >>>> >> >>
