Hi Karl, I am proceeding modifying the Solr connector introducing a new flag that will control the operative mode : 1) using Extract Update handler ( as it is right now) 2) using the SolrInputDocument and classic SolrJ add.
I will introduce a flag checkbox as we did for the "keepAllMetadata" . Is already an issue for that karl? Let me know! Cheers 2014-06-18 16:10 GMT+01:00 Karl Wright <[email protected]>: > Hi Alessandro, > > The reason for backwards compatibility is obvious: people upgrade > ManifoldCF all the time, and when they do it should not stop working for > them. > > Putting Tika all the time in the pipeline is also not appropriate for other > output connections. Even if you did it just for Solr, you'd then have to > insure that the Tika transformer was exactly compatible with Solr Cell, > which I would be very uncomfortable with agreeing to. > > So let's presume that you'd do one of two things. Either: > > - Leave the existing Solr connector alone, and create a whole new Solr > connector designed to work with a Tika transformer, or > - Modify the existing Solr connector so that it operates in two possible > modes, one of which supports the legacy model (the default), and one of > which supports your new model > > If this sounds overly burdensome, I'm sorry but it's necessary until MCF > 2.0. For MCF 2.0, which I've begun to think about, we can dispense with > backwards compatibility, including legacy tabs that have outlived their > usefulness, etc. But that's not a 1.7 solution. > > Karl > > > > On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti < > [email protected]> wrote: > > > Hello Karl, > > What i was thinking is: > > assuming we have the Tika Connector, the responsibility to extract > content > > will pass from Solr to the Tika processor. > > > > So we can change the part in the Solr Connector that manages the building > > of the request to send to the Extract update handler. > > Particularly that part will change in the classic way: usually it's good > to > > build a SolrDocument in SolrJ and then add it to SolrServer. > > > > Why should we give retrocompatibility from Solr Connector point of view ? > > From the user point of view, a Job will be selected with the Tika > Conenctor > > in the pipeline, so we are providing the same identical feature. > > One way can be to make the Tika Processor Connector by default in the > > pipeline, and someone will be able to deactivate it only if needed. > > > > Cheers > > > > > > > > 2014-06-18 14:32 GMT+01:00 Karl Wright <[email protected]>: > > > > > Hi Alessandro, > > > What is your concrete proposal to change the Solr connector? Bear in > > mind > > > that we do need to maintain backwards compatibility. If you list your > > > specific changes, not in any huge detail, but with enough detail that > we > > > understand your proposal, that would help. What happens to the UI? > What > > > happens to the internals? > > > > > > Thanks, > > > Karl > > > > > > > > > > > > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti < > > > [email protected]> wrote: > > > > > > > But guys, why not simply pass to a classic SolrJ SolrDocument > creation > > > and > > > > ingestion in the Solr Server ? Easy and Straighforward ! > > > > > > > > In the end at that point the RepositoryDocument will me only a Map of > > > > metadata and values. > > > > Content will be part of that, so I guess the conversion to a > > SolrDocument > > > > will be immediate. > > > > > > > > Cheers > > > > > > > > > > > > 2014-06-18 3:26 GMT+01:00 Karl Wright <[email protected]>: > > > > > > > > > Hi Abe-san, > > > > > > > > > > Near as I can tell, the major consumer of disk space is the Maven > > > target > > > > > directories. This is generating many tens of megabytes of > temporary > > > disk > > > > > usage for every connector. Luckily if you use ant, this is not a > > > > problem. > > > > > > > > > > Karl > > > > > > > > > > > > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <[email protected]> > > > wrote: > > > > > > > > > > > Hi Abe-san, > > > > > > > > > > > > Tika jars are not very big: > > > > > > > > > > > > C:\wip\mcf\trunk\lib>dir tika* > > > > > > Volume in drive C has no label. > > > > > > Volume Serial Number is 002E-D1F0 > > > > > > > > > > > > Directory of C:\wip\mcf\trunk\lib > > > > > > > > > > > > 06/05/2014 08:21 AM 493,374 tika-core.jar > > > > > > 06/05/2014 08:21 AM 523,677 tika-parsers.jar > > > > > > 2 File(s) 1,017,051 bytes > > > > > > 0 Dir(s) 140,792,315,904 bytes free > > > > > > > > > > > > The entire lib directory is 85M: > > > > > > > > > > > > 85,156,330 bytes > > > > > > > > > > > > The built binary image is still about 185Mb, I believe. So I > don't > > > > know > > > > > > why you think it is >1Gb? Temporary class files? I don't think > we > > > can > > > > > > avoid those. > > > > > > > > > > > > I'd rather not make things more complicated than they need to be > by > > > > > adding > > > > > > a new required service - even though it would fit naturally with > > the > > > > > > connector arrangement. > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe < > > > > > > [email protected]> wrote: > > > > > > > > > > > >> Hi Karl, > > > > > >> > > > > > >> Okay, I assumed Tika connector outputs files. > > > > > >> If we post character data metadata got from Tika, > > "/update/extract" > > > > > >> handler > > > > > >> can handle this(provides params: > > > > > >> literal.content=value&literal.metaField=foobar > > > > > >> with using NullInputStream for binary data like CONNECTORS-936). > > > > > >> > > > > > >> BTW, now trunk built size is too big(1G+). Maybe because > > CloudSearch > > > > > >> connector uses Tika jars. > > > > > >> Tika connector and CloudSearch connector should extract text via > > > > > >> tika-server[1] > > > > > >> and MCF should not have many Tika jars, do you think? > > > > > >> > > > > > >> [1] > > > > > >> http://wiki.apache.org/tika/TikaJAXRS > > > > > >> > > > > > >> Thanks, > > > > > >> Shinichiro Abe > > > > > >> > > > > > >> On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote: > > > > > >> > > > > > >> > Hi Abe-san, > > > > > >> > > > > > > >> > It sounds like you might be thinking that transformation > > > connectors > > > > > are > > > > > >> > like output connectors. Just so we are clear, transformation > > > > > >> connectors in > > > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a > > > > > >> > RepositoryDocument on to the next connector in the chain. So > I > > > > don't > > > > > >> know > > > > > >> > why .xml files would be involved. I'd expect the Tika > connector > > > to > > > > > >> read a > > > > > >> > binary file from one RepositoryDocument object and convert its > > > > > contents > > > > > >> to > > > > > >> > another RepositoryDocument object which would have character > > data > > > > and > > > > > >> > metadata only. Would this work for your case, do you think? > > > > > >> > > > > > > >> > Karl > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe < > > > > > >> [email protected]> > > > > > >> > wrote: > > > > > >> > > > > > > >> >> Hi Karl, > > > > > >> >> > > > > > >> >> Yes. I thought the standard update handler met that > > requirement. > > > > > >> >> For instance, Tika extractor transformation connector creates > > two > > > > > >> files. > > > > > >> >> 1. addtoSolr.xml for add and update > > > > > >> >> 2. deletetoSolr.xml for delete > > > > > >> >> File connector ingests these xml files, then Solr connector > > posts > > > > > these > > > > > >> >> files by "/update" handler. > > > > > >> >> > > > > > >> >> In the the Solr Connector, other function as to update > handler > > > > > >> >> might not be necessary except for "/update" handler. > > > > > >> >> > > > > > >> >> Thanks, > > > > > >> >> Shinichiro Abe > > > > > >> >> > > > > > >> >> On 2014/06/18, at 8:02, Karl Wright <[email protected]> > > wrote: > > > > > >> >> > > > > > >> >>> Hi Abe-san, > > > > > >> >>> > > > > > >> >>> So just to be sure -- you believe that no changes at all are > > > > > required > > > > > >> to > > > > > >> >>> the Solr Connector as it stands now, other than to use the > > > update > > > > > >> handler > > > > > >> >>> rather than the /update/extract handler? > > > > > >> >>> > > > > > >> >>> Karl > > > > > >> >>> > > > > > >> >>> > > > > > >> >>> > > > > > >> >>> > > > > > >> >>> > > > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe < > > > > > >> >> [email protected]> > > > > > >> >>> wrote: > > > > > >> >>> > > > > > >> >>>>> As for changing the Solr connector so that it doesn't go > to > > > the > > > > > >> >> extracting > > > > > >> >>>> update handler > > > > > >> >>>> > > > > > >> >>>> I don't think it needs to change Solr connector with new > > > checkbox > > > > > >> >> because > > > > > >> >>>> currently we can change "/update/extract" into "/update" at > > > > 'Update > > > > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I > > could > > > > > post > > > > > >> >> CSV, > > > > > >> >>>> JSON and XML files to Solr by changing that and using File > > > > > connector. > > > > > >> >> So I > > > > > >> >>>> wish we allow Tika extractor transformation connector to > > create > > > > XML > > > > > >> >> files > > > > > >> >>>> that Solr expects to see. > > > > > >> >>>> > > > > > >> >>>> Regards, > > > > > >> >>>> Shinichiro Abe > > > > > >> >>>> > > > > > >> >>>> > > > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected] > >: > > > > > >> >>>> > > > > > >> >>>>> The pipeline code itself is now "complete" in trunk. > Zaizi > > > said > > > > > >> they'd > > > > > >> >>>>> contribute a Tika extractor transformation connector - and > > if > > > > they > > > > > >> >> don't > > > > > >> >>>>> get around to that in a month or so, I may take a crack at > > it > > > > > >> myself. > > > > > >> >>>>> > > > > > >> >>>>> As for changing the Solr connector so that it doesn't go > to > > > the > > > > > >> >>>> extracting > > > > > >> >>>>> update handler, it would be great if: > > > > > >> >>>>> (1) Someone created a ticket for this, and > > > > > >> >>>>> (2) A patch was provided that maintains backwards > > > compatibility > > > > > with > > > > > >> >>>>> previous versions of the connector (so a checkbox would > > > probably > > > > > >> need > > > > > >> >> to > > > > > >> >>>> go > > > > > >> >>>>> into the UI somewhere). Do either of you want to start > this > > > > > >> process? > > > > > >> >>>>> > > > > > >> >>>>> Thanks! > > > > > >> >>>>> Karl > > > > > >> >>>>> > > > > > >> >>>>> > > > > > >> >>>>> > > > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright < > > > > [email protected] > > > > > > > > > > > >> >>>> wrote: > > > > > >> >>>>> > > > > > >> >>>>>> Hi guys, > > > > > >> >>>>>> > > > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a > full > > > > > >> pipeline, > > > > > >> >>>> and > > > > > >> >>>>>> is expected to have a Tika extractor as a transformation > > > > > connector. > > > > > >> >>>>>> > > > > > >> >>>>>> Karl > > > > > >> >>>>>> > > > > > >> >>>>>> > > > > > >> >>>>>> > > > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla < > > > > > >> >>>>> [email protected]> > > > > > >> >>>>>> wrote: > > > > > >> >>>>>> > > > > > >> >>>>>>> Thanks Alessandro, > > > > > >> >>>>>>> that explains the situation clearly. > > > > > >> >>>>>>> And I agree that sending all the metadata as get > parameter > > > can > > > > > be > > > > > >> >>>>>>> problematic > > > > > >> >>>>>>> > > > > > >> >>>>>>> Cheers > > > > > >> >>>>>>> > > > > > >> >>>>>>> -- > > > > > >> >>>>>>> Matteo Grolla > > > > > >> >>>>>>> Sourcesense - making sense of Open Source > > > > > >> >>>>>>> http://www.sourcesense.com > > > > > >> >>>>>>> > > > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro > > Benedetti > > > ha > > > > > >> >>>> scritto: > > > > > >> >>>>>>> > > > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no > > > > extractors. > > > > > >> >>>>>>>> The Repository connectors extracts directly the binary > > and > > > > > there > > > > > >> is > > > > > >> >>>> no > > > > > >> >>>>>>>> "Extractor Processor" yet. > > > > > >> >>>>>>>> But recently a pipe-line processor architecture has > been > > > > > thought > > > > > >> ( > > > > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959) > > > > > >> >>>>>>>> So can fit there. > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> Cheers > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla < > > > > > >> [email protected] > > > > > >> >>>>> : > > > > > >> >>>>>>>> > > > > > >> >>>>>>>>> Since Solr extracting request handler takes the binary > > and > > > > > >> extracts > > > > > >> >>>>>>> text > > > > > >> >>>>>>>>> what is the point of not using Manifold extractor and > > send > > > > > text > > > > > >> and > > > > > >> >>>>>>>>> binaries to solr? > > > > > >> >>>>>>>>> I mean the end result is the same solr indexes text > and > > > > stores > > > > > >> text > > > > > >> >>>>>>>>> So if manifold supports text extraction it seems me > this > > > is > > > > > the > > > > > >> >>>> place > > > > > >> >>>>>>>>> where it should be done > > > > > >> >>>>>>>>> > > > > > >> >>>>>>>>> -- > > > > > >> >>>>>>>>> Matteo Grolla > > > > > >> >>>>>>>>> Sourcesense - making sense of Open Source > > > > > >> >>>>>>>>> http://www.sourcesense.com > > > > > >> >>>>>>>>> > > > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David > > Perez > > > > > >> Morales > > > > > >> >>>> ha > > > > > >> >>>>>>>>> scritto: > > > > > >> >>>>>>>>> > > > > > >> >>>>>>>>>> Hi Matteo > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>> Manifold already handles the extraction, but the only > > way > > > > to > > > > > >> send > > > > > >> >>>>>>> binary > > > > > >> >>>>>>>>>> content and document metadata to Solr is using the > > > > > >> update/extract > > > > > >> >>>>>>>>> handler, > > > > > >> >>>>>>>>>> where the metadata is sent as query parameters and > the > > > > binary > > > > > >> >>>>> content > > > > > >> >>>>>>> is > > > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to > use > > > Tika > > > > > to > > > > > >> >>>>> obtain > > > > > >> >>>>>>> the > > > > > >> >>>>>>>>>> raw content to be stored in Solr. > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>> Regards > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla < > > > > > >> >>>>>>> [email protected] > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>> wrote: > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold > > uses > > > > > Solr > > > > > >> >>>>>>> extracting > > > > > >> >>>>>>>>>>> request handler to extract the content of an xml > file > > > > > >> >>>>>>>>>>> For performance reasons it would be better if > Manifold > > > > > handled > > > > > >> >>>> the > > > > > >> >>>>>>>>>>> extraction letting Solr do the search engine > > > > > >> >>>>>>>>>>> Is this because of the connector design, framework > > > design > > > > or > > > > > >> just > > > > > >> >>>>> to > > > > > >> >>>>>>> be > > > > > >> >>>>>>>>>>> done? > > > > > >> >>>>>>>>>>> > > > > > >> >>>>>>>>>>> -- > > > > > >> >>>>>>>>>>> Matteo Grolla > > > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source > > > > > >> >>>>>>>>>>> http://www.sourcesense.com > > > > > >> >>>>>>>>>>> > > > > > >> >>>>>>>>>>> > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>> -- > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>> ------------------------------ > > > > > >> >>>>>>>>>> This message should be regarded as confidential. If > you > > > > have > > > > > >> >>>>> received > > > > > >> >>>>>>>>> this > > > > > >> >>>>>>>>>> email in error please notify the sender and destroy > it > > > > > >> >>>> immediately. > > > > > >> >>>>>>>>>> Statements of intent shall only become binding when > > > > confirmed > > > > > >> in > > > > > >> >>>>> hard > > > > > >> >>>>>>>>> copy > > > > > >> >>>>>>>>>> by an authorised signatory. > > > > > >> >>>>>>>>>> > > > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the > > > > > >> registration > > > > > >> >>>>>>> number > > > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 > > > > Shepherds > > > > > >> Bush > > > > > >> >>>>>>> Road, > > > > > >> >>>>>>>>>> London W6 7AN. > > > > > >> >>>>>>>>> > > > > > >> >>>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> -- > > > > > >> >>>>>>>> -------------------------- > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> Benedetti Alessandro > > > > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> "Tyger, tyger burning bright > > > > > >> >>>>>>>> In the forests of the night, > > > > > >> >>>>>>>> What immortal hand or eye > > > > > >> >>>>>>>> Could frame thy fearful symmetry?" > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England > > > > > >> >>>>>>> > > > > > >> >>>>>>> > > > > > >> >>>>>> > > > > > >> >>>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >> >>>> > > > > > >> >>>> -- > > > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > - - > > > > > >> >>>> Shinichiro Abe > > > > > >> >>>> 阿部 慎一朗 > > > > > >> >>>> > > > > > >> >> > > > > > >> >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -------------------------- > > > > > > > > Benedetti Alessandro > > > > Visiting card : http://about.me/alessandro_benedetti > > > > > > > > "Tyger, tyger burning bright > > > > In the forests of the night, > > > > What immortal hand or eye > > > > Could frame thy fearful symmetry?" > > > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > > > > > > > -- > > -------------------------- > > > > Benedetti Alessandro > > Visiting card : http://about.me/alessandro_benedetti > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > > > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
