But guys, why not simply pass to a classic SolrJ SolrDocument creation and ingestion in the Solr Server ? Easy and Straighforward !
In the end at that point the RepositoryDocument will me only a Map of metadata and values. Content will be part of that, so I guess the conversion to a SolrDocument will be immediate. Cheers 2014-06-18 3:26 GMT+01:00 Karl Wright <[email protected]>: > Hi Abe-san, > > Near as I can tell, the major consumer of disk space is the Maven target > directories. This is generating many tens of megabytes of temporary disk > usage for every connector. Luckily if you use ant, this is not a problem. > > Karl > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <[email protected]> wrote: > > > Hi Abe-san, > > > > Tika jars are not very big: > > > > C:\wip\mcf\trunk\lib>dir tika* > > Volume in drive C has no label. > > Volume Serial Number is 002E-D1F0 > > > > Directory of C:\wip\mcf\trunk\lib > > > > 06/05/2014 08:21 AM 493,374 tika-core.jar > > 06/05/2014 08:21 AM 523,677 tika-parsers.jar > > 2 File(s) 1,017,051 bytes > > 0 Dir(s) 140,792,315,904 bytes free > > > > The entire lib directory is 85M: > > > > 85,156,330 bytes > > > > The built binary image is still about 185Mb, I believe. So I don't know > > why you think it is >1Gb? Temporary class files? I don't think we can > > avoid those. > > > > I'd rather not make things more complicated than they need to be by > adding > > a new required service - even though it would fit naturally with the > > connector arrangement. > > > > Karl > > > > > > > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe < > > [email protected]> wrote: > > > >> Hi Karl, > >> > >> Okay, I assumed Tika connector outputs files. > >> If we post character data metadata got from Tika, "/update/extract" > >> handler > >> can handle this(provides params: > >> literal.content=value&literal.metaField=foobar > >> with using NullInputStream for binary data like CONNECTORS-936). > >> > >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch > >> connector uses Tika jars. > >> Tika connector and CloudSearch connector should extract text via > >> tika-server[1] > >> and MCF should not have many Tika jars, do you think? > >> > >> [1] > >> http://wiki.apache.org/tika/TikaJAXRS > >> > >> Thanks, > >> Shinichiro Abe > >> > >> On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote: > >> > >> > Hi Abe-san, > >> > > >> > It sounds like you might be thinking that transformation connectors > are > >> > like output connectors. Just so we are clear, transformation > >> connectors in > >> > 1.7 receive a RepositoryDocument as input, and then pass a > >> > RepositoryDocument on to the next connector in the chain. So I don't > >> know > >> > why .xml files would be involved. I'd expect the Tika connector to > >> read a > >> > binary file from one RepositoryDocument object and convert its > contents > >> to > >> > another RepositoryDocument object which would have character data and > >> > metadata only. Would this work for your case, do you think? > >> > > >> > Karl > >> > > >> > > >> > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe < > >> [email protected]> > >> > wrote: > >> > > >> >> Hi Karl, > >> >> > >> >> Yes. I thought the standard update handler met that requirement. > >> >> For instance, Tika extractor transformation connector creates two > >> files. > >> >> 1. addtoSolr.xml for add and update > >> >> 2. deletetoSolr.xml for delete > >> >> File connector ingests these xml files, then Solr connector posts > these > >> >> files by "/update" handler. > >> >> > >> >> In the the Solr Connector, other function as to update handler > >> >> might not be necessary except for "/update" handler. > >> >> > >> >> Thanks, > >> >> Shinichiro Abe > >> >> > >> >> On 2014/06/18, at 8:02, Karl Wright <[email protected]> wrote: > >> >> > >> >>> Hi Abe-san, > >> >>> > >> >>> So just to be sure -- you believe that no changes at all are > required > >> to > >> >>> the Solr Connector as it stands now, other than to use the update > >> handler > >> >>> rather than the /update/extract handler? > >> >>> > >> >>> Karl > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe < > >> >> [email protected]> > >> >>> wrote: > >> >>> > >> >>>>> As for changing the Solr connector so that it doesn't go to the > >> >> extracting > >> >>>> update handler > >> >>>> > >> >>>> I don't think it needs to change Solr connector with new checkbox > >> >> because > >> >>>> currently we can change "/update/extract" into "/update" at 'Update > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could > post > >> >> CSV, > >> >>>> JSON and XML files to Solr by changing that and using File > connector. > >> >> So I > >> >>>> wish we allow Tika extractor transformation connector to create XML > >> >> files > >> >>>> that Solr expects to see. > >> >>>> > >> >>>> Regards, > >> >>>> Shinichiro Abe > >> >>>> > >> >>>> > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected]>: > >> >>>> > >> >>>>> The pipeline code itself is now "complete" in trunk. Zaizi said > >> they'd > >> >>>>> contribute a Tika extractor transformation connector - and if they > >> >> don't > >> >>>>> get around to that in a month or so, I may take a crack at it > >> myself. > >> >>>>> > >> >>>>> As for changing the Solr connector so that it doesn't go to the > >> >>>> extracting > >> >>>>> update handler, it would be great if: > >> >>>>> (1) Someone created a ticket for this, and > >> >>>>> (2) A patch was provided that maintains backwards compatibility > with > >> >>>>> previous versions of the connector (so a checkbox would probably > >> need > >> >> to > >> >>>> go > >> >>>>> into the UI somewhere). Do either of you want to start this > >> process? > >> >>>>> > >> >>>>> Thanks! > >> >>>>> Karl > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <[email protected] > > > >> >>>> wrote: > >> >>>>> > >> >>>>>> Hi guys, > >> >>>>>> > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a full > >> pipeline, > >> >>>> and > >> >>>>>> is expected to have a Tika extractor as a transformation > connector. > >> >>>>>> > >> >>>>>> Karl > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla < > >> >>>>> [email protected]> > >> >>>>>> wrote: > >> >>>>>> > >> >>>>>>> Thanks Alessandro, > >> >>>>>>> that explains the situation clearly. > >> >>>>>>> And I agree that sending all the metadata as get parameter can > be > >> >>>>>>> problematic > >> >>>>>>> > >> >>>>>>> Cheers > >> >>>>>>> > >> >>>>>>> -- > >> >>>>>>> Matteo Grolla > >> >>>>>>> Sourcesense - making sense of Open Source > >> >>>>>>> http://www.sourcesense.com > >> >>>>>>> > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha > >> >>>> scritto: > >> >>>>>>> > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no extractors. > >> >>>>>>>> The Repository connectors extracts directly the binary and > there > >> is > >> >>>> no > >> >>>>>>>> "Extractor Processor" yet. > >> >>>>>>>> But recently a pipe-line processor architecture has been > thought > >> ( > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959) > >> >>>>>>>> So can fit there. > >> >>>>>>>> > >> >>>>>>>> Cheers > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla < > >> [email protected] > >> >>>>> : > >> >>>>>>>> > >> >>>>>>>>> Since Solr extracting request handler takes the binary and > >> extracts > >> >>>>>>> text > >> >>>>>>>>> what is the point of not using Manifold extractor and send > text > >> and > >> >>>>>>>>> binaries to solr? > >> >>>>>>>>> I mean the end result is the same solr indexes text and stores > >> text > >> >>>>>>>>> So if manifold supports text extraction it seems me this is > the > >> >>>> place > >> >>>>>>>>> where it should be done > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> Matteo Grolla > >> >>>>>>>>> Sourcesense - making sense of Open Source > >> >>>>>>>>> http://www.sourcesense.com > >> >>>>>>>>> > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez > >> Morales > >> >>>> ha > >> >>>>>>>>> scritto: > >> >>>>>>>>> > >> >>>>>>>>>> Hi Matteo > >> >>>>>>>>>> > >> >>>>>>>>>> Manifold already handles the extraction, but the only way to > >> send > >> >>>>>>> binary > >> >>>>>>>>>> content and document metadata to Solr is using the > >> update/extract > >> >>>>>>>>> handler, > >> >>>>>>>>>> where the metadata is sent as query parameters and the binary > >> >>>>> content > >> >>>>>>> is > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika > to > >> >>>>> obtain > >> >>>>>>> the > >> >>>>>>>>>> raw content to be stored in Solr. > >> >>>>>>>>>> > >> >>>>>>>>>> Regards > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla < > >> >>>>>>> [email protected] > >> >>>>>>>>>> > >> >>>>>>>>>> wrote: > >> >>>>>>>>>> > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses > Solr > >> >>>>>>> extracting > >> >>>>>>>>>>> request handler to extract the content of an xml file > >> >>>>>>>>>>> For performance reasons it would be better if Manifold > handled > >> >>>> the > >> >>>>>>>>>>> extraction letting Solr do the search engine > >> >>>>>>>>>>> Is this because of the connector design, framework design or > >> just > >> >>>>> to > >> >>>>>>> be > >> >>>>>>>>>>> done? > >> >>>>>>>>>>> > >> >>>>>>>>>>> -- > >> >>>>>>>>>>> Matteo Grolla > >> >>>>>>>>>>> Sourcesense - making sense of Open Source > >> >>>>>>>>>>> http://www.sourcesense.com > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> -- > >> >>>>>>>>>> > >> >>>>>>>>>> ------------------------------ > >> >>>>>>>>>> This message should be regarded as confidential. If you have > >> >>>>> received > >> >>>>>>>>> this > >> >>>>>>>>>> email in error please notify the sender and destroy it > >> >>>> immediately. > >> >>>>>>>>>> Statements of intent shall only become binding when confirmed > >> in > >> >>>>> hard > >> >>>>>>>>> copy > >> >>>>>>>>>> by an authorised signatory. > >> >>>>>>>>>> > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the > >> registration > >> >>>>>>> number > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds > >> Bush > >> >>>>>>> Road, > >> >>>>>>>>>> London W6 7AN. > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> -- > >> >>>>>>>> -------------------------- > >> >>>>>>>> > >> >>>>>>>> Benedetti Alessandro > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti > >> >>>>>>>> > >> >>>>>>>> "Tyger, tyger burning bright > >> >>>>>>>> In the forests of the night, > >> >>>>>>>> What immortal hand or eye > >> >>>>>>>> Could frame thy fearful symmetry?" > >> >>>>>>>> > >> >>>>>>>> William Blake - Songs of Experience -1794 England > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> -- > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > >> >>>> Shinichiro Abe > >> >>>> 阿部 慎一朗 > >> >>>> > >> >> > >> >> > >> > >> > > > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
