Hi Karl, > The entire lib directory is 85M: You are correct. I'm sorry, trunk size exceeded 1g as I ran 'ant javadoc', so no problem.
> I'd rather not make things more complicated than they need to be by adding > a new required service Ok. I understand. Shinichiro Abe On 2014/06/18, at 10:55, Karl Wright <[email protected]> wrote: > Hi Abe-san, > > Tika jars are not very big: > > C:\wip\mcf\trunk\lib>dir tika* > Volume in drive C has no label. > Volume Serial Number is 002E-D1F0 > > Directory of C:\wip\mcf\trunk\lib > > 06/05/2014 08:21 AM 493,374 tika-core.jar > 06/05/2014 08:21 AM 523,677 tika-parsers.jar > 2 File(s) 1,017,051 bytes > 0 Dir(s) 140,792,315,904 bytes free > > The entire lib directory is 85M: > > 85,156,330 bytes > > The built binary image is still about 185Mb, I believe. So I don't know > why you think it is >1Gb? Temporary class files? I don't think we can > avoid those. > > I'd rather not make things more complicated than they need to be by adding > a new required service - even though it would fit naturally with the > connector arrangement. > > Karl > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <[email protected]> > wrote: > >> Hi Karl, >> >> Okay, I assumed Tika connector outputs files. >> If we post character data metadata got from Tika, "/update/extract" handler >> can handle this(provides params: >> literal.content=value&literal.metaField=foobar >> with using NullInputStream for binary data like CONNECTORS-936). >> >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch >> connector uses Tika jars. >> Tika connector and CloudSearch connector should extract text via >> tika-server[1] >> and MCF should not have many Tika jars, do you think? >> >> [1] >> http://wiki.apache.org/tika/TikaJAXRS >> >> Thanks, >> Shinichiro Abe >> >> On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote: >> >>> Hi Abe-san, >>> >>> It sounds like you might be thinking that transformation connectors are >>> like output connectors. Just so we are clear, transformation connectors >> in >>> 1.7 receive a RepositoryDocument as input, and then pass a >>> RepositoryDocument on to the next connector in the chain. So I don't >> know >>> why .xml files would be involved. I'd expect the Tika connector to read >> a >>> binary file from one RepositoryDocument object and convert its contents >> to >>> another RepositoryDocument object which would have character data and >>> metadata only. Would this work for your case, do you think? >>> >>> Karl >>> >>> >>> >>> On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe < >> [email protected]> >>> wrote: >>> >>>> Hi Karl, >>>> >>>> Yes. I thought the standard update handler met that requirement. >>>> For instance, Tika extractor transformation connector creates two files. >>>> 1. addtoSolr.xml for add and update >>>> 2. deletetoSolr.xml for delete >>>> File connector ingests these xml files, then Solr connector posts these >>>> files by "/update" handler. >>>> >>>> In the the Solr Connector, other function as to update handler >>>> might not be necessary except for "/update" handler. >>>> >>>> Thanks, >>>> Shinichiro Abe >>>> >>>> On 2014/06/18, at 8:02, Karl Wright <[email protected]> wrote: >>>> >>>>> Hi Abe-san, >>>>> >>>>> So just to be sure -- you believe that no changes at all are required >> to >>>>> the Solr Connector as it stands now, other than to use the update >> handler >>>>> rather than the /update/extract handler? >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe < >>>> [email protected]> >>>>> wrote: >>>>> >>>>>>> As for changing the Solr connector so that it doesn't go to the >>>> extracting >>>>>> update handler >>>>>> >>>>>> I don't think it needs to change Solr connector with new checkbox >>>> because >>>>>> currently we can change "/update/extract" into "/update" at 'Update >>>>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post >>>> CSV, >>>>>> JSON and XML files to Solr by changing that and using File connector. >>>> So I >>>>>> wish we allow Tika extractor transformation connector to create XML >>>> files >>>>>> that Solr expects to see. >>>>>> >>>>>> Regards, >>>>>> Shinichiro Abe >>>>>> >>>>>> >>>>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected]>: >>>>>> >>>>>>> The pipeline code itself is now "complete" in trunk. Zaizi said >> they'd >>>>>>> contribute a Tika extractor transformation connector - and if they >>>> don't >>>>>>> get around to that in a month or so, I may take a crack at it myself. >>>>>>> >>>>>>> As for changing the Solr connector so that it doesn't go to the >>>>>> extracting >>>>>>> update handler, it would be great if: >>>>>>> (1) Someone created a ticket for this, and >>>>>>> (2) A patch was provided that maintains backwards compatibility with >>>>>>> previous versions of the connector (so a checkbox would probably need >>>> to >>>>>> go >>>>>>> into the UI somewhere). Do either of you want to start this process? >>>>>>> >>>>>>> Thanks! >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> You folks may not have looked at 1.7 yet, but it has a full >> pipeline, >>>>>> and >>>>>>>> is expected to have a Tika extractor as a transformation connector. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla < >>>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks Alessandro, >>>>>>>>> that explains the situation clearly. >>>>>>>>> And I agree that sending all the metadata as get parameter can be >>>>>>>>> problematic >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Matteo Grolla >>>>>>>>> Sourcesense - making sense of Open Source >>>>>>>>> http://www.sourcesense.com >>>>>>>>> >>>>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha >>>>>> scritto: >>>>>>>>> >>>>>>>>>> mmmm the point is that right now ManifoldCF has no extractors. >>>>>>>>>> The Repository connectors extracts directly the binary and there >> is >>>>>> no >>>>>>>>>> "Extractor Processor" yet. >>>>>>>>>> But recently a pipe-line processor architecture has been thought ( >>>>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959) >>>>>>>>>> So can fit there. >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla < >> [email protected] >>>>>>> : >>>>>>>>>> >>>>>>>>>>> Since Solr extracting request handler takes the binary and >> extracts >>>>>>>>> text >>>>>>>>>>> what is the point of not using Manifold extractor and send text >> and >>>>>>>>>>> binaries to solr? >>>>>>>>>>> I mean the end result is the same solr indexes text and stores >> text >>>>>>>>>>> So if manifold supports text extraction it seems me this is the >>>>>> place >>>>>>>>>>> where it should be done >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Matteo Grolla >>>>>>>>>>> Sourcesense - making sense of Open Source >>>>>>>>>>> http://www.sourcesense.com >>>>>>>>>>> >>>>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez >> Morales >>>>>> ha >>>>>>>>>>> scritto: >>>>>>>>>>> >>>>>>>>>>>> Hi Matteo >>>>>>>>>>>> >>>>>>>>>>>> Manifold already handles the extraction, but the only way to >> send >>>>>>>>> binary >>>>>>>>>>>> content and document metadata to Solr is using the >> update/extract >>>>>>>>>>> handler, >>>>>>>>>>>> where the metadata is sent as query parameters and the binary >>>>>>> content >>>>>>>>> is >>>>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to >>>>>>> obtain >>>>>>>>> the >>>>>>>>>>>> raw content to be stored in Solr. >>>>>>>>>>>> >>>>>>>>>>>> Regards >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla < >>>>>>>>> [email protected] >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr >>>>>>>>> extracting >>>>>>>>>>>>> request handler to extract the content of an xml file >>>>>>>>>>>>> For performance reasons it would be better if Manifold handled >>>>>> the >>>>>>>>>>>>> extraction letting Solr do the search engine >>>>>>>>>>>>> Is this because of the connector design, framework design or >> just >>>>>>> to >>>>>>>>> be >>>>>>>>>>>>> done? >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Matteo Grolla >>>>>>>>>>>>> Sourcesense - making sense of Open Source >>>>>>>>>>>>> http://www.sourcesense.com >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------ >>>>>>>>>>>> This message should be regarded as confidential. If you have >>>>>>> received >>>>>>>>>>> this >>>>>>>>>>>> email in error please notify the sender and destroy it >>>>>> immediately. >>>>>>>>>>>> Statements of intent shall only become binding when confirmed in >>>>>>> hard >>>>>>>>>>> copy >>>>>>>>>>>> by an authorised signatory. >>>>>>>>>>>> >>>>>>>>>>>> Zaizi Ltd is registered in England and Wales with the >> registration >>>>>>>>> number >>>>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds >> Bush >>>>>>>>> Road, >>>>>>>>>>>> London W6 7AN. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> -------------------------- >>>>>>>>>> >>>>>>>>>> Benedetti Alessandro >>>>>>>>>> Visiting card : http://about.me/alessandro_benedetti >>>>>>>>>> >>>>>>>>>> "Tyger, tyger burning bright >>>>>>>>>> In the forests of the night, >>>>>>>>>> What immortal hand or eye >>>>>>>>>> Could frame thy fearful symmetry?" >>>>>>>>>> >>>>>>>>>> William Blake - Songs of Experience -1794 England >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >>>>>> Shinichiro Abe >>>>>> 阿部 慎一朗 >>>>>> >>>> >>>> >> >>
