bq. I don't agree on this. Why is not appropriate for all the connectors ? Some output connectors want the document in binary form -- e.g. the HDFS and FileSystem connectors, which don't deal with metadata at all. It's not clear whether the Tika transformer would preserve the binary stream, or would replace the binary stream with an extracted content stream. I'd kind-of expect the latter, but there are other ways to do it, of course. But it would certainly impact performance, so it should not be a requirement. Not only that, but there's no *reason* to make it a requirement, since you can very readily add it or remove it from the pipeline in the UI.
bq. So what is the problem of using Tika outside Solr? We've seen a number of cases where Tika inside Solr does things based on (for instance) http headers that Solr receives. Abe-san had some difficulty with that a while back. We had to repeatedly fix things when we went to SolrJ to make sure various headers were compatible so that SolrCell worked the same. I'd rather not re-implement SolrCell precisely in ManifoldCF if I can help it. bq. Solr Extract is using Tika under the hood, nothing more. It's more complicated than that. Have a look at the code. bq. probably a simple flag can fit to operate in one way or another. I agree that that should be sufficient. Karl On Wed, Jun 18, 2014 at 11:35 AM, Alessandro Benedetti < [email protected]> wrote: > 2014-06-18 16:10 GMT+01:00 Karl Wright <[email protected]>: > > > Hi Alessandro, > > > > The reason for backwards compatibility is obvious: people upgrade > > ManifoldCF all the time, and when they do it should not stop working for > > them. > > > Ok i agree ! > > > > > Putting Tika all the time in the pipeline is also not appropriate for > other > > output connections. > > > I don't agree on this. Why is not appropriate for all the connectors ? > The conceptual responsibility of an output Connector should be to post a > RespositoryDocument to an output ( whatever we want) . > A RepositoryDocument is a map Field-> value. > The content is nothing than a one of these fields. > So I can not see why after we have a RepositoryDocument ( with content > extracted) , should not be possible to send it independently to any > OutputConnector. > > > > Even if you did it just for Solr, you'd then have to > > insure that the Tika transformer was exactly compatible with Solr Cell, > > which I would be very uncomfortable with agreeing to. > > > > So what is the problem of using Tika outside Solr? We will add the most > recent version of Tika, that will be gradually upgraded over time with the > platform. > > Solr Extract is using Tika under the hood, nothing more. > > > > > So let's presume that you'd do one of two things. Either: > > > > - Leave the existing Solr connector alone, and create a whole new Solr > > connector designed to work with a Tika transformer, or > > - Modify the existing Solr connector so that it operates in two possible > > modes, one of which supports the legacy model (the default), and one of > > which supports your new model > > > > probably a simple flag can fit to operate in one way or another. > > > > > If this sounds overly burdensome, I'm sorry but it's necessary until MCF > > 2.0. For MCF 2.0, which I've begun to think about, we can dispense with > > backwards compatibility, including legacy tabs that have outlived their > > usefulness, etc. But that's not a 1.7 solution. > > > > Karl > > > > Cheers > > > > > > > > > On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti < > > [email protected]> wrote: > > > > > Hello Karl, > > > What i was thinking is: > > > assuming we have the Tika Connector, the responsibility to extract > > content > > > will pass from Solr to the Tika processor. > > > > > > So we can change the part in the Solr Connector that manages the > building > > > of the request to send to the Extract update handler. > > > Particularly that part will change in the classic way: usually it's > good > > to > > > build a SolrDocument in SolrJ and then add it to SolrServer. > > > > > > Why should we give retrocompatibility from Solr Connector point of > view ? > > > From the user point of view, a Job will be selected with the Tika > > Conenctor > > > in the pipeline, so we are providing the same identical feature. > > > One way can be to make the Tika Processor Connector by default in the > > > pipeline, and someone will be able to deactivate it only if needed. > > > > > > Cheers > > > > > > > > > > > > 2014-06-18 14:32 GMT+01:00 Karl Wright <[email protected]>: > > > > > > > Hi Alessandro, > > > > What is your concrete proposal to change the Solr connector? Bear in > > > mind > > > > that we do need to maintain backwards compatibility. If you list > your > > > > specific changes, not in any huge detail, but with enough detail that > > we > > > > understand your proposal, that would help. What happens to the UI? > > What > > > > happens to the internals? > > > > > > > > Thanks, > > > > Karl > > > > > > > > > > > > > > > > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti < > > > > [email protected]> wrote: > > > > > > > > > But guys, why not simply pass to a classic SolrJ SolrDocument > > creation > > > > and > > > > > ingestion in the Solr Server ? Easy and Straighforward ! > > > > > > > > > > In the end at that point the RepositoryDocument will me only a Map > of > > > > > metadata and values. > > > > > Content will be part of that, so I guess the conversion to a > > > SolrDocument > > > > > will be immediate. > > > > > > > > > > Cheers > > > > > > > > > > > > > > > 2014-06-18 3:26 GMT+01:00 Karl Wright <[email protected]>: > > > > > > > > > > > Hi Abe-san, > > > > > > > > > > > > Near as I can tell, the major consumer of disk space is the Maven > > > > target > > > > > > directories. This is generating many tens of megabytes of > > temporary > > > > disk > > > > > > usage for every connector. Luckily if you use ant, this is not a > > > > > problem. > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <[email protected] > > > > > > wrote: > > > > > > > > > > > > > Hi Abe-san, > > > > > > > > > > > > > > Tika jars are not very big: > > > > > > > > > > > > > > C:\wip\mcf\trunk\lib>dir tika* > > > > > > > Volume in drive C has no label. > > > > > > > Volume Serial Number is 002E-D1F0 > > > > > > > > > > > > > > Directory of C:\wip\mcf\trunk\lib > > > > > > > > > > > > > > 06/05/2014 08:21 AM 493,374 tika-core.jar > > > > > > > 06/05/2014 08:21 AM 523,677 tika-parsers.jar > > > > > > > 2 File(s) 1,017,051 bytes > > > > > > > 0 Dir(s) 140,792,315,904 bytes free > > > > > > > > > > > > > > The entire lib directory is 85M: > > > > > > > > > > > > > > 85,156,330 bytes > > > > > > > > > > > > > > The built binary image is still about 185Mb, I believe. So I > > don't > > > > > know > > > > > > > why you think it is >1Gb? Temporary class files? I don't > think > > we > > > > can > > > > > > > avoid those. > > > > > > > > > > > > > > I'd rather not make things more complicated than they need to > be > > by > > > > > > adding > > > > > > > a new required service - even though it would fit naturally > with > > > the > > > > > > > connector arrangement. > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe < > > > > > > > [email protected]> wrote: > > > > > > > > > > > > > >> Hi Karl, > > > > > > >> > > > > > > >> Okay, I assumed Tika connector outputs files. > > > > > > >> If we post character data metadata got from Tika, > > > "/update/extract" > > > > > > >> handler > > > > > > >> can handle this(provides params: > > > > > > >> literal.content=value&literal.metaField=foobar > > > > > > >> with using NullInputStream for binary data like > CONNECTORS-936). > > > > > > >> > > > > > > >> BTW, now trunk built size is too big(1G+). Maybe because > > > CloudSearch > > > > > > >> connector uses Tika jars. > > > > > > >> Tika connector and CloudSearch connector should extract text > via > > > > > > >> tika-server[1] > > > > > > >> and MCF should not have many Tika jars, do you think? > > > > > > >> > > > > > > >> [1] > > > > > > >> http://wiki.apache.org/tika/TikaJAXRS > > > > > > >> > > > > > > >> Thanks, > > > > > > >> Shinichiro Abe > > > > > > >> > > > > > > >> On 2014/06/18, at 9:45, Karl Wright <[email protected]> > wrote: > > > > > > >> > > > > > > >> > Hi Abe-san, > > > > > > >> > > > > > > > >> > It sounds like you might be thinking that transformation > > > > connectors > > > > > > are > > > > > > >> > like output connectors. Just so we are clear, > transformation > > > > > > >> connectors in > > > > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a > > > > > > >> > RepositoryDocument on to the next connector in the chain. > So > > I > > > > > don't > > > > > > >> know > > > > > > >> > why .xml files would be involved. I'd expect the Tika > > connector > > > > to > > > > > > >> read a > > > > > > >> > binary file from one RepositoryDocument object and convert > its > > > > > > contents > > > > > > >> to > > > > > > >> > another RepositoryDocument object which would have character > > > data > > > > > and > > > > > > >> > metadata only. Would this work for your case, do you think? > > > > > > >> > > > > > > > >> > Karl > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe < > > > > > > >> [email protected]> > > > > > > >> > wrote: > > > > > > >> > > > > > > > >> >> Hi Karl, > > > > > > >> >> > > > > > > >> >> Yes. I thought the standard update handler met that > > > requirement. > > > > > > >> >> For instance, Tika extractor transformation connector > creates > > > two > > > > > > >> files. > > > > > > >> >> 1. addtoSolr.xml for add and update > > > > > > >> >> 2. deletetoSolr.xml for delete > > > > > > >> >> File connector ingests these xml files, then Solr connector > > > posts > > > > > > these > > > > > > >> >> files by "/update" handler. > > > > > > >> >> > > > > > > >> >> In the the Solr Connector, other function as to update > > handler > > > > > > >> >> might not be necessary except for "/update" handler. > > > > > > >> >> > > > > > > >> >> Thanks, > > > > > > >> >> Shinichiro Abe > > > > > > >> >> > > > > > > >> >> On 2014/06/18, at 8:02, Karl Wright <[email protected]> > > > wrote: > > > > > > >> >> > > > > > > >> >>> Hi Abe-san, > > > > > > >> >>> > > > > > > >> >>> So just to be sure -- you believe that no changes at all > are > > > > > > required > > > > > > >> to > > > > > > >> >>> the Solr Connector as it stands now, other than to use the > > > > update > > > > > > >> handler > > > > > > >> >>> rather than the /update/extract handler? > > > > > > >> >>> > > > > > > >> >>> Karl > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe < > > > > > > >> >> [email protected]> > > > > > > >> >>> wrote: > > > > > > >> >>> > > > > > > >> >>>>> As for changing the Solr connector so that it doesn't go > > to > > > > the > > > > > > >> >> extracting > > > > > > >> >>>> update handler > > > > > > >> >>>> > > > > > > >> >>>> I don't think it needs to change Solr connector with new > > > > checkbox > > > > > > >> >> because > > > > > > >> >>>> currently we can change "/update/extract" into "/update" > at > > > > > 'Update > > > > > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I > > > could > > > > > > post > > > > > > >> >> CSV, > > > > > > >> >>>> JSON and XML files to Solr by changing that and using > File > > > > > > connector. > > > > > > >> >> So I > > > > > > >> >>>> wish we allow Tika extractor transformation connector to > > > create > > > > > XML > > > > > > >> >> files > > > > > > >> >>>> that Solr expects to see. > > > > > > >> >>>> > > > > > > >> >>>> Regards, > > > > > > >> >>>> Shinichiro Abe > > > > > > >> >>>> > > > > > > >> >>>> > > > > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright < > [email protected] > > >: > > > > > > >> >>>> > > > > > > >> >>>>> The pipeline code itself is now "complete" in trunk. > > Zaizi > > > > said > > > > > > >> they'd > > > > > > >> >>>>> contribute a Tika extractor transformation connector - > and > > > if > > > > > they > > > > > > >> >> don't > > > > > > >> >>>>> get around to that in a month or so, I may take a crack > at > > > it > > > > > > >> myself. > > > > > > >> >>>>> > > > > > > >> >>>>> As for changing the Solr connector so that it doesn't go > > to > > > > the > > > > > > >> >>>> extracting > > > > > > >> >>>>> update handler, it would be great if: > > > > > > >> >>>>> (1) Someone created a ticket for this, and > > > > > > >> >>>>> (2) A patch was provided that maintains backwards > > > > compatibility > > > > > > with > > > > > > >> >>>>> previous versions of the connector (so a checkbox would > > > > probably > > > > > > >> need > > > > > > >> >> to > > > > > > >> >>>> go > > > > > > >> >>>>> into the UI somewhere). Do either of you want to start > > this > > > > > > >> process? > > > > > > >> >>>>> > > > > > > >> >>>>> Thanks! > > > > > > >> >>>>> Karl > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright < > > > > > [email protected] > > > > > > > > > > > > > >> >>>> wrote: > > > > > > >> >>>>> > > > > > > >> >>>>>> Hi guys, > > > > > > >> >>>>>> > > > > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a > > full > > > > > > >> pipeline, > > > > > > >> >>>> and > > > > > > >> >>>>>> is expected to have a Tika extractor as a > transformation > > > > > > connector. > > > > > > >> >>>>>> > > > > > > >> >>>>>> Karl > > > > > > >> >>>>>> > > > > > > >> >>>>>> > > > > > > >> >>>>>> > > > > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla < > > > > > > >> >>>>> [email protected]> > > > > > > >> >>>>>> wrote: > > > > > > >> >>>>>> > > > > > > >> >>>>>>> Thanks Alessandro, > > > > > > >> >>>>>>> that explains the situation clearly. > > > > > > >> >>>>>>> And I agree that sending all the metadata as get > > parameter > > > > can > > > > > > be > > > > > > >> >>>>>>> problematic > > > > > > >> >>>>>>> > > > > > > >> >>>>>>> Cheers > > > > > > >> >>>>>>> > > > > > > >> >>>>>>> -- > > > > > > >> >>>>>>> Matteo Grolla > > > > > > >> >>>>>>> Sourcesense - making sense of Open Source > > > > > > >> >>>>>>> http://www.sourcesense.com > > > > > > >> >>>>>>> > > > > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro > > > Benedetti > > > > ha > > > > > > >> >>>> scritto: > > > > > > >> >>>>>>> > > > > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no > > > > > extractors. > > > > > > >> >>>>>>>> The Repository connectors extracts directly the > binary > > > and > > > > > > there > > > > > > >> is > > > > > > >> >>>> no > > > > > > >> >>>>>>>> "Extractor Processor" yet. > > > > > > >> >>>>>>>> But recently a pipe-line processor architecture has > > been > > > > > > thought > > > > > > >> ( > > > > > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959 > ) > > > > > > >> >>>>>>>> So can fit there. > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> Cheers > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla < > > > > > > >> [email protected] > > > > > > >> >>>>> : > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>>> Since Solr extracting request handler takes the > binary > > > and > > > > > > >> extracts > > > > > > >> >>>>>>> text > > > > > > >> >>>>>>>>> what is the point of not using Manifold extractor > and > > > send > > > > > > text > > > > > > >> and > > > > > > >> >>>>>>>>> binaries to solr? > > > > > > >> >>>>>>>>> I mean the end result is the same solr indexes text > > and > > > > > stores > > > > > > >> text > > > > > > >> >>>>>>>>> So if manifold supports text extraction it seems me > > this > > > > is > > > > > > the > > > > > > >> >>>> place > > > > > > >> >>>>>>>>> where it should be done > > > > > > >> >>>>>>>>> > > > > > > >> >>>>>>>>> -- > > > > > > >> >>>>>>>>> Matteo Grolla > > > > > > >> >>>>>>>>> Sourcesense - making sense of Open Source > > > > > > >> >>>>>>>>> http://www.sourcesense.com > > > > > > >> >>>>>>>>> > > > > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David > > > Perez > > > > > > >> Morales > > > > > > >> >>>> ha > > > > > > >> >>>>>>>>> scritto: > > > > > > >> >>>>>>>>> > > > > > > >> >>>>>>>>>> Hi Matteo > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>> Manifold already handles the extraction, but the > only > > > way > > > > > to > > > > > > >> send > > > > > > >> >>>>>>> binary > > > > > > >> >>>>>>>>>> content and document metadata to Solr is using the > > > > > > >> update/extract > > > > > > >> >>>>>>>>> handler, > > > > > > >> >>>>>>>>>> where the metadata is sent as query parameters and > > the > > > > > binary > > > > > > >> >>>>> content > > > > > > >> >>>>>>> is > > > > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to > > use > > > > Tika > > > > > > to > > > > > > >> >>>>> obtain > > > > > > >> >>>>>>> the > > > > > > >> >>>>>>>>>> raw content to be stored in Solr. > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>> Regards > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla < > > > > > > >> >>>>>>> [email protected] > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>> wrote: > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that > manifold > > > uses > > > > > > Solr > > > > > > >> >>>>>>> extracting > > > > > > >> >>>>>>>>>>> request handler to extract the content of an xml > > file > > > > > > >> >>>>>>>>>>> For performance reasons it would be better if > > Manifold > > > > > > handled > > > > > > >> >>>> the > > > > > > >> >>>>>>>>>>> extraction letting Solr do the search engine > > > > > > >> >>>>>>>>>>> Is this because of the connector design, framework > > > > design > > > > > or > > > > > > >> just > > > > > > >> >>>>> to > > > > > > >> >>>>>>> be > > > > > > >> >>>>>>>>>>> done? > > > > > > >> >>>>>>>>>>> > > > > > > >> >>>>>>>>>>> -- > > > > > > >> >>>>>>>>>>> Matteo Grolla > > > > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source > > > > > > >> >>>>>>>>>>> http://www.sourcesense.com > > > > > > >> >>>>>>>>>>> > > > > > > >> >>>>>>>>>>> > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>> -- > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>> ------------------------------ > > > > > > >> >>>>>>>>>> This message should be regarded as confidential. If > > you > > > > > have > > > > > > >> >>>>> received > > > > > > >> >>>>>>>>> this > > > > > > >> >>>>>>>>>> email in error please notify the sender and destroy > > it > > > > > > >> >>>> immediately. > > > > > > >> >>>>>>>>>> Statements of intent shall only become binding when > > > > > confirmed > > > > > > >> in > > > > > > >> >>>>> hard > > > > > > >> >>>>>>>>> copy > > > > > > >> >>>>>>>>>> by an authorised signatory. > > > > > > >> >>>>>>>>>> > > > > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with > the > > > > > > >> registration > > > > > > >> >>>>>>> number > > > > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 > > > > > Shepherds > > > > > > >> Bush > > > > > > >> >>>>>>> Road, > > > > > > >> >>>>>>>>>> London W6 7AN. > > > > > > >> >>>>>>>>> > > > > > > >> >>>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> -- > > > > > > >> >>>>>>>> -------------------------- > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> Benedetti Alessandro > > > > > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> "Tyger, tyger burning bright > > > > > > >> >>>>>>>> In the forests of the night, > > > > > > >> >>>>>>>> What immortal hand or eye > > > > > > >> >>>>>>>> Could frame thy fearful symmetry?" > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England > > > > > > >> >>>>>>> > > > > > > >> >>>>>>> > > > > > > >> >>>>>> > > > > > > >> >>>>> > > > > > > >> >>>> > > > > > > >> >>>> > > > > > > >> >>>> > > > > > > >> >>>> -- > > > > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - > > > - - > > > > > > >> >>>> Shinichiro Abe > > > > > > >> >>>> 阿部 慎一朗 > > > > > > >> >>>> > > > > > > >> >> > > > > > > >> >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -------------------------- > > > > > > > > > > Benedetti Alessandro > > > > > Visiting card : http://about.me/alessandro_benedetti > > > > > > > > > > "Tyger, tyger burning bright > > > > > In the forests of the night, > > > > > What immortal hand or eye > > > > > Could frame thy fearful symmetry?" > > > > > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > > > > > > > > > > > > > -- > > > -------------------------- > > > > > > Benedetti Alessandro > > > Visiting card : http://about.me/alessandro_benedetti > > > > > > "Tyger, tyger burning bright > > > In the forests of the night, > > > What immortal hand or eye > > > Could frame thy fearful symmetry?" > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England >
