Re: Solr Extracting request handler

Shinichiro Abe Tue, 17 Jun 2014 18:43:29 -0700

Hi Karl,

Okay, I assumed Tika connector outputs files. 
If we post character data metadata got from Tika, "/update/extract" handler 
can handle this(provides params: literal.content=value&literal.metaField=foobar
with using NullInputStream for binary data like CONNECTORS-936).


BTW, now trunk built size is too big(1G+). Maybe because CloudSearch connector 
uses Tika jars.
Tika connector and CloudSearch connector should extract text via tika-server[1] 
and MCF should not have many Tika jars, do you think?

[1]
http://wiki.apache.org/tika/TikaJAXRS

Thanks,
Shinichiro Abe

On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote:

> Hi Abe-san,
> 
> It sounds like you might be thinking that transformation connectors are
> like output connectors.  Just so we are clear, transformation connectors in
> 1.7 receive a RepositoryDocument as input, and then pass a
> RepositoryDocument on to the next connector in the chain.  So I don't know
> why .xml files would be involved.  I'd expect the Tika connector to read a
> binary file from one RepositoryDocument object and convert its contents to
> another RepositoryDocument object which would have character data and
> metadata only.  Would this work for your case, do you think?
> 
> Karl
> 
> 
> 
> On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <[email protected]>
> wrote:
> 
>> Hi Karl,
>> 
>> Yes. I thought the standard update handler met that requirement.
>> For instance, Tika extractor transformation connector creates two files.
>> 1. addtoSolr.xml for add and update
>> 2. deletetoSolr.xml for delete
>> File connector ingests these xml files, then Solr connector posts these
>> files by "/update" handler.
>> 
>> In the the Solr Connector, other function as to update handler
>> might not be necessary except for  "/update" handler.
>> 
>> Thanks,
>> Shinichiro Abe
>> 
>> On 2014/06/18, at 8:02, Karl Wright <[email protected]> wrote:
>> 
>>> Hi Abe-san,
>>> 
>>> So just to be sure -- you believe that no changes at all are required to
>>> the Solr Connector as it stands now, other than to use the update handler
>>> rather than the /update/extract handler?
>>> 
>>> Karl
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>> [email protected]>
>>> wrote:
>>> 
>>>>> As for changing the Solr connector so that it doesn't go to the
>> extracting
>>>> update handler
>>>> 
>>>> I don't think it needs to change Solr connector with new checkbox
>> because
>>>> currently we can change "/update/extract" into "/update" at 'Update
>>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post
>> CSV,
>>>> JSON and XML files to Solr by changing that and using File connector.
>> So I
>>>> wish we allow Tika extractor transformation connector to create XML
>> files
>>>> that Solr expects to see.
>>>> 
>>>> Regards,
>>>> Shinichiro Abe
>>>> 
>>>> 
>>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected]>:
>>>> 
>>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
>>>>> contribute a Tika extractor transformation connector - and if they
>> don't
>>>>> get around to that in a month or so, I may take a crack at it myself.
>>>>> 
>>>>> As for changing the Solr connector so that it doesn't go to the
>>>> extracting
>>>>> update handler, it would be great if:
>>>>> (1) Someone created a ticket for this, and
>>>>> (2) A patch was provided that maintains backwards compatibility with
>>>>> previous versions of the connector (so a checkbox would probably need
>> to
>>>> go
>>>>> into the UI somewhere).  Do either of you want to start this process?
>>>>> 
>>>>> Thanks!
>>>>> Karl
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> Hi guys,
>>>>>> 
>>>>>> You folks may not have looked at 1.7 yet, but it has a full pipeline,
>>>> and
>>>>>> is expected to have a Tika extractor as a transformation connector.
>>>>>> 
>>>>>> Karl
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>>>>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Thanks Alessandro,
>>>>>>>       that explains the situation clearly.
>>>>>>> And I agree that sending all the metadata as get parameter can be
>>>>>>> problematic
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> --
>>>>>>> Matteo Grolla
>>>>>>> Sourcesense - making sense of Open Source
>>>>>>> http://www.sourcesense.com
>>>>>>> 
>>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
>>>> scritto:
>>>>>>> 
>>>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
>>>>>>>> The Repository connectors extracts directly the binary and there is
>>>> no
>>>>>>>> "Extractor Processor" yet.
>>>>>>>> But recently a pipe-line processor architecture has been thought (
>>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>>>>>>> So can fit there.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <[email protected]
>>>>> :
>>>>>>>> 
>>>>>>>>> Since Solr extracting request handler takes the binary and extracts
>>>>>>> text
>>>>>>>>> what is the point of not using Manifold extractor and send text and
>>>>>>>>> binaries to solr?
>>>>>>>>> I mean the end result is the same solr indexes text and stores text
>>>>>>>>> So if manifold supports text extraction it seems me this is the
>>>> place
>>>>>>>>> where it should be done
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Matteo Grolla
>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>> http://www.sourcesense.com
>>>>>>>>> 
>>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales
>>>> ha
>>>>>>>>> scritto:
>>>>>>>>> 
>>>>>>>>>> Hi Matteo
>>>>>>>>>> 
>>>>>>>>>> Manifold already handles the extraction, but the only way to send
>>>>>>> binary
>>>>>>>>>> content and document metadata to Solr is using the update/extract
>>>>>>>>> handler,
>>>>>>>>>> where the metadata is sent as query parameters and the binary
>>>>> content
>>>>>>> is
>>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
>>>>> obtain
>>>>>>> the
>>>>>>>>>> raw content to be stored in Solr.
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>>>>>>> [email protected]
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
>>>>>>> extracting
>>>>>>>>>>> request handler to extract the content of an xml file
>>>>>>>>>>> For performance reasons it would be better if Manifold handled
>>>> the
>>>>>>>>>>> extraction letting Solr do the search engine
>>>>>>>>>>> Is this because of the connector design, framework design or just
>>>>> to
>>>>>>> be
>>>>>>>>>>> done?
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> ------------------------------
>>>>>>>>>> This message should be regarded as confidential. If you have
>>>>> received
>>>>>>>>> this
>>>>>>>>>> email in error please notify the sender and destroy it
>>>> immediately.
>>>>>>>>>> Statements of intent shall only become binding when confirmed in
>>>>> hard
>>>>>>>>> copy
>>>>>>>>>> by an authorised signatory.
>>>>>>>>>> 
>>>>>>>>>> Zaizi Ltd is registered in England and Wales with the registration
>>>>>>> number
>>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
>>>>>>> Road,
>>>>>>>>>> London W6 7AN.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> --------------------------
>>>>>>>> 
>>>>>>>> Benedetti Alessandro
>>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>> 
>>>>>>>> "Tyger, tyger burning bright
>>>>>>>> In the forests of the night,
>>>>>>>> What immortal hand or eye
>>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>> 
>>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>> Shinichiro Abe
>>>> 阿部 慎一朗
>>>> 
>> 
>>

Re: Solr Extracting request handler

Reply via email to