Re: Solr Extracting request handler

Shinichiro Abe Tue, 17 Jun 2014 19:26:19 -0700

Hi Karl,

> The entire lib directory is 85M:
You are correct. I'm sorry, trunk size exceeded 1g as I ran 'ant javadoc', so 
no problem.


> I'd rather not make things more complicated than they need to be by adding
> a new required service
Ok. I understand.

Shinichiro Abe

On 2014/06/18, at 10:55, Karl Wright <[email protected]> wrote:

> Hi Abe-san,
> 
> Tika jars are not very big:
> 
> C:\wip\mcf\trunk\lib>dir tika*
> Volume in drive C has no label.
> Volume Serial Number is 002E-D1F0
> 
> Directory of C:\wip\mcf\trunk\lib
> 
> 06/05/2014  08:21 AM           493,374 tika-core.jar
> 06/05/2014  08:21 AM           523,677 tika-parsers.jar
>               2 File(s)      1,017,051 bytes
>               0 Dir(s)  140,792,315,904 bytes free
> 
> The entire lib directory is 85M:
> 
> 85,156,330 bytes
> 
> The built binary image is still about 185Mb, I believe.  So I don't know
> why you think it is >1Gb?  Temporary class files?  I don't think we can
> avoid those.
> 
> I'd rather not make things more complicated than they need to be by adding
> a new required service - even though it would fit naturally with the
> connector arrangement.
> 
> Karl
> 
> 
> 
> 
> 
> On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <[email protected]>
> wrote:
> 
>> Hi Karl,
>> 
>> Okay, I assumed Tika connector outputs files.
>> If we post character data metadata got from Tika, "/update/extract" handler
>> can handle this(provides params:
>> literal.content=value&literal.metaField=foobar
>> with using NullInputStream for binary data like CONNECTORS-936).
>> 
>> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
>> connector uses Tika jars.
>> Tika connector and CloudSearch connector should extract text via
>> tika-server[1]
>> and MCF should not have many Tika jars, do you think?
>> 
>> [1]
>> http://wiki.apache.org/tika/TikaJAXRS
>> 
>> Thanks,
>> Shinichiro Abe
>> 
>> On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote:
>> 
>>> Hi Abe-san,
>>> 
>>> It sounds like you might be thinking that transformation connectors are
>>> like output connectors.  Just so we are clear, transformation connectors
>> in
>>> 1.7 receive a RepositoryDocument as input, and then pass a
>>> RepositoryDocument on to the next connector in the chain.  So I don't
>> know
>>> why .xml files would be involved.  I'd expect the Tika connector to read
>> a
>>> binary file from one RepositoryDocument object and convert its contents
>> to
>>> another RepositoryDocument object which would have character data and
>>> metadata only.  Would this work for your case, do you think?
>>> 
>>> Karl
>>> 
>>> 
>>> 
>>> On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>> [email protected]>
>>> wrote:
>>> 
>>>> Hi Karl,
>>>> 
>>>> Yes. I thought the standard update handler met that requirement.
>>>> For instance, Tika extractor transformation connector creates two files.
>>>> 1. addtoSolr.xml for add and update
>>>> 2. deletetoSolr.xml for delete
>>>> File connector ingests these xml files, then Solr connector posts these
>>>> files by "/update" handler.
>>>> 
>>>> In the the Solr Connector, other function as to update handler
>>>> might not be necessary except for  "/update" handler.
>>>> 
>>>> Thanks,
>>>> Shinichiro Abe
>>>> 
>>>> On 2014/06/18, at 8:02, Karl Wright <[email protected]> wrote:
>>>> 
>>>>> Hi Abe-san,
>>>>> 
>>>>> So just to be sure -- you believe that no changes at all are required
>> to
>>>>> the Solr Connector as it stands now, other than to use the update
>> handler
>>>>> rather than the /update/extract handler?
>>>>> 
>>>>> Karl
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>>>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>>> As for changing the Solr connector so that it doesn't go to the
>>>> extracting
>>>>>> update handler
>>>>>> 
>>>>>> I don't think it needs to change Solr connector with new checkbox
>>>> because
>>>>>> currently we can change "/update/extract" into "/update" at 'Update
>>>>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post
>>>> CSV,
>>>>>> JSON and XML files to Solr by changing that and using File connector.
>>>> So I
>>>>>> wish we allow Tika extractor transformation connector to create XML
>>>> files
>>>>>> that Solr expects to see.
>>>>>> 
>>>>>> Regards,
>>>>>> Shinichiro Abe
>>>>>> 
>>>>>> 
>>>>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected]>:
>>>>>> 
>>>>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said
>> they'd
>>>>>>> contribute a Tika extractor transformation connector - and if they
>>>> don't
>>>>>>> get around to that in a month or so, I may take a crack at it myself.
>>>>>>> 
>>>>>>> As for changing the Solr connector so that it doesn't go to the
>>>>>> extracting
>>>>>>> update handler, it would be great if:
>>>>>>> (1) Someone created a ticket for this, and
>>>>>>> (2) A patch was provided that maintains backwards compatibility with
>>>>>>> previous versions of the connector (so a checkbox would probably need
>>>> to
>>>>>> go
>>>>>>> into the UI somewhere).  Do either of you want to start this process?
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> Karl
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi guys,
>>>>>>>> 
>>>>>>>> You folks may not have looked at 1.7 yet, but it has a full
>> pipeline,
>>>>>> and
>>>>>>>> is expected to have a Tika extractor as a transformation connector.
>>>>>>>> 
>>>>>>>> Karl
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks Alessandro,
>>>>>>>>>      that explains the situation clearly.
>>>>>>>>> And I agree that sending all the metadata as get parameter can be
>>>>>>>>> problematic
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Matteo Grolla
>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>> http://www.sourcesense.com
>>>>>>>>> 
>>>>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
>>>>>> scritto:
>>>>>>>>> 
>>>>>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
>>>>>>>>>> The Repository connectors extracts directly the binary and there
>> is
>>>>>> no
>>>>>>>>>> "Extractor Processor" yet.
>>>>>>>>>> But recently a pipe-line processor architecture has been thought (
>>>>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>>>>>>>>> So can fit there.
>>>>>>>>>> 
>>>>>>>>>> Cheers
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
>> [email protected]
>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> Since Solr extracting request handler takes the binary and
>> extracts
>>>>>>>>> text
>>>>>>>>>>> what is the point of not using Manifold extractor and send text
>> and
>>>>>>>>>>> binaries to solr?
>>>>>>>>>>> I mean the end result is the same solr indexes text and stores
>> text
>>>>>>>>>>> So if manifold supports text extraction it seems me this is the
>>>>>> place
>>>>>>>>>>> where it should be done
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>> 
>>>>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
>> Morales
>>>>>> ha
>>>>>>>>>>> scritto:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Matteo
>>>>>>>>>>>> 
>>>>>>>>>>>> Manifold already handles the extraction, but the only way to
>> send
>>>>>>>>> binary
>>>>>>>>>>>> content and document metadata to Solr is using the
>> update/extract
>>>>>>>>>>> handler,
>>>>>>>>>>>> where the metadata is sent as query parameters and the binary
>>>>>>> content
>>>>>>>>> is
>>>>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
>>>>>>> obtain
>>>>>>>>> the
>>>>>>>>>>>> raw content to be stored in Solr.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>>>>>>>>> [email protected]
>>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
>>>>>>>>> extracting
>>>>>>>>>>>>> request handler to extract the content of an xml file
>>>>>>>>>>>>> For performance reasons it would be better if Manifold handled
>>>>>> the
>>>>>>>>>>>>> extraction letting Solr do the search engine
>>>>>>>>>>>>> Is this because of the connector design, framework design or
>> just
>>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>>> done?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> 
>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>> This message should be regarded as confidential. If you have
>>>>>>> received
>>>>>>>>>>> this
>>>>>>>>>>>> email in error please notify the sender and destroy it
>>>>>> immediately.
>>>>>>>>>>>> Statements of intent shall only become binding when confirmed in
>>>>>>> hard
>>>>>>>>>>> copy
>>>>>>>>>>>> by an authorised signatory.
>>>>>>>>>>>> 
>>>>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
>> registration
>>>>>>>>> number
>>>>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
>> Bush
>>>>>>>>> Road,
>>>>>>>>>>>> London W6 7AN.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> --------------------------
>>>>>>>>>> 
>>>>>>>>>> Benedetti Alessandro
>>>>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>>>> 
>>>>>>>>>> "Tyger, tyger burning bright
>>>>>>>>>> In the forests of the night,
>>>>>>>>>> What immortal hand or eye
>>>>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>>>> 
>>>>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>> Shinichiro Abe
>>>>>> 阿部 慎一朗
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Solr Extracting request handler

Reply via email to