Hi Alessandro,
ideally I think that text extraction from rich documents should be
Manifold responsibility, not Solr's
So the ideal place to implement it would be in the new document processing
pipeline (using Tika)
--
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com
Il giorno 18/giu/2014, alle ore 16:16, Alessandro Benedetti ha scritto:
> Hello Karl,
> What i was thinking is:
> assuming we have the Tika Connector, the responsibility to extract content
> will pass from Solr to the Tika processor.
>
> So we can change the part in the Solr Connector that manages the building
> of the request to send to the Extract update handler.
> Particularly that part will change in the classic way: usually it's good to
> build a SolrDocument in SolrJ and then add it to SolrServer.
>
> Why should we give retrocompatibility from Solr Connector point of view ?
> From the user point of view, a Job will be selected with the Tika Conenctor
> in the pipeline, so we are providing the same identical feature.
> One way can be to make the Tika Processor Connector by default in the
> pipeline, and someone will be able to deactivate it only if needed.
>
> Cheers
>
>
>
> 2014-06-18 14:32 GMT+01:00 Karl Wright <[email protected]>:
>
>> Hi Alessandro,
>> What is your concrete proposal to change the Solr connector? Bear in mind
>> that we do need to maintain backwards compatibility. If you list your
>> specific changes, not in any huge detail, but with enough detail that we
>> understand your proposal, that would help. What happens to the UI? What
>> happens to the internals?
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
>> [email protected]> wrote:
>>
>>> But guys, why not simply pass to a classic SolrJ SolrDocument creation
>> and
>>> ingestion in the Solr Server ? Easy and Straighforward !
>>>
>>> In the end at that point the RepositoryDocument will me only a Map of
>>> metadata and values.
>>> Content will be part of that, so I guess the conversion to a SolrDocument
>>> will be immediate.
>>>
>>> Cheers
>>>
>>>
>>> 2014-06-18 3:26 GMT+01:00 Karl Wright <[email protected]>:
>>>
>>>> Hi Abe-san,
>>>>
>>>> Near as I can tell, the major consumer of disk space is the Maven
>> target
>>>> directories. This is generating many tens of megabytes of temporary
>> disk
>>>> usage for every connector. Luckily if you use ant, this is not a
>>> problem.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <[email protected]>
>> wrote:
>>>>
>>>>> Hi Abe-san,
>>>>>
>>>>> Tika jars are not very big:
>>>>>
>>>>> C:\wip\mcf\trunk\lib>dir tika*
>>>>> Volume in drive C has no label.
>>>>> Volume Serial Number is 002E-D1F0
>>>>>
>>>>> Directory of C:\wip\mcf\trunk\lib
>>>>>
>>>>> 06/05/2014 08:21 AM 493,374 tika-core.jar
>>>>> 06/05/2014 08:21 AM 523,677 tika-parsers.jar
>>>>> 2 File(s) 1,017,051 bytes
>>>>> 0 Dir(s) 140,792,315,904 bytes free
>>>>>
>>>>> The entire lib directory is 85M:
>>>>>
>>>>> 85,156,330 bytes
>>>>>
>>>>> The built binary image is still about 185Mb, I believe. So I don't
>>> know
>>>>> why you think it is >1Gb? Temporary class files? I don't think we
>> can
>>>>> avoid those.
>>>>>
>>>>> I'd rather not make things more complicated than they need to be by
>>>> adding
>>>>> a new required service - even though it would fit naturally with the
>>>>> connector arrangement.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> Okay, I assumed Tika connector outputs files.
>>>>>> If we post character data metadata got from Tika, "/update/extract"
>>>>>> handler
>>>>>> can handle this(provides params:
>>>>>> literal.content=value&literal.metaField=foobar
>>>>>> with using NullInputStream for binary data like CONNECTORS-936).
>>>>>>
>>>>>> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
>>>>>> connector uses Tika jars.
>>>>>> Tika connector and CloudSearch connector should extract text via
>>>>>> tika-server[1]
>>>>>> and MCF should not have many Tika jars, do you think?
>>>>>>
>>>>>> [1]
>>>>>> http://wiki.apache.org/tika/TikaJAXRS
>>>>>>
>>>>>> Thanks,
>>>>>> Shinichiro Abe
>>>>>>
>>>>>> On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Abe-san,
>>>>>>>
>>>>>>> It sounds like you might be thinking that transformation
>> connectors
>>>> are
>>>>>>> like output connectors. Just so we are clear, transformation
>>>>>> connectors in
>>>>>>> 1.7 receive a RepositoryDocument as input, and then pass a
>>>>>>> RepositoryDocument on to the next connector in the chain. So I
>>> don't
>>>>>> know
>>>>>>> why .xml files would be involved. I'd expect the Tika connector
>> to
>>>>>> read a
>>>>>>> binary file from one RepositoryDocument object and convert its
>>>> contents
>>>>>> to
>>>>>>> another RepositoryDocument object which would have character data
>>> and
>>>>>>> metadata only. Would this work for your case, do you think?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> Yes. I thought the standard update handler met that requirement.
>>>>>>>> For instance, Tika extractor transformation connector creates two
>>>>>> files.
>>>>>>>> 1. addtoSolr.xml for add and update
>>>>>>>> 2. deletetoSolr.xml for delete
>>>>>>>> File connector ingests these xml files, then Solr connector posts
>>>> these
>>>>>>>> files by "/update" handler.
>>>>>>>>
>>>>>>>> In the the Solr Connector, other function as to update handler
>>>>>>>> might not be necessary except for "/update" handler.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shinichiro Abe
>>>>>>>>
>>>>>>>> On 2014/06/18, at 8:02, Karl Wright <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Abe-san,
>>>>>>>>>
>>>>>>>>> So just to be sure -- you believe that no changes at all are
>>>> required
>>>>>> to
>>>>>>>>> the Solr Connector as it stands now, other than to use the
>> update
>>>>>> handler
>>>>>>>>> rather than the /update/extract handler?
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>>>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>> As for changing the Solr connector so that it doesn't go to
>> the
>>>>>>>> extracting
>>>>>>>>>> update handler
>>>>>>>>>>
>>>>>>>>>> I don't think it needs to change Solr connector with new
>> checkbox
>>>>>>>> because
>>>>>>>>>> currently we can change "/update/extract" into "/update" at
>>> 'Update
>>>>>>>>>> Handler' at Paths tab in Solr connector UI. I confirmed I could
>>>> post
>>>>>>>> CSV,
>>>>>>>>>> JSON and XML files to Solr by changing that and using File
>>>> connector.
>>>>>>>> So I
>>>>>>>>>> wish we allow Tika extractor transformation connector to create
>>> XML
>>>>>>>> files
>>>>>>>>>> that Solr expects to see.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Shinichiro Abe
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected]>:
>>>>>>>>>>
>>>>>>>>>>> The pipeline code itself is now "complete" in trunk. Zaizi
>> said
>>>>>> they'd
>>>>>>>>>>> contribute a Tika extractor transformation connector - and if
>>> they
>>>>>>>> don't
>>>>>>>>>>> get around to that in a month or so, I may take a crack at it
>>>>>> myself.
>>>>>>>>>>>
>>>>>>>>>>> As for changing the Solr connector so that it doesn't go to
>> the
>>>>>>>>>> extracting
>>>>>>>>>>> update handler, it would be great if:
>>>>>>>>>>> (1) Someone created a ticket for this, and
>>>>>>>>>>> (2) A patch was provided that maintains backwards
>> compatibility
>>>> with
>>>>>>>>>>> previous versions of the connector (so a checkbox would
>> probably
>>>>>> need
>>>>>>>> to
>>>>>>>>>> go
>>>>>>>>>>> into the UI somewhere). Do either of you want to start this
>>>>>> process?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
>>> [email protected]
>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>
>>>>>>>>>>>> You folks may not have looked at 1.7 yet, but it has a full
>>>>>> pipeline,
>>>>>>>>>> and
>>>>>>>>>>>> is expected to have a Tika extractor as a transformation
>>>> connector.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>>>>>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Alessandro,
>>>>>>>>>>>>> that explains the situation clearly.
>>>>>>>>>>>>> And I agree that sending all the metadata as get parameter
>> can
>>>> be
>>>>>>>>>>>>> problematic
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>>
>>>>>>>>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti
>> ha
>>>>>>>>>> scritto:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> mmmm the point is that right now ManifoldCF has no
>>> extractors.
>>>>>>>>>>>>>> The Repository connectors extracts directly the binary and
>>>> there
>>>>>> is
>>>>>>>>>> no
>>>>>>>>>>>>>> "Extractor Processor" yet.
>>>>>>>>>>>>>> But recently a pipe-line processor architecture has been
>>>> thought
>>>>>> (
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>>>>>>>>>>>>> So can fit there.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
>>>>>> [email protected]
>>>>>>>>>>> :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since Solr extracting request handler takes the binary and
>>>>>> extracts
>>>>>>>>>>>>> text
>>>>>>>>>>>>>>> what is the point of not using Manifold extractor and send
>>>> text
>>>>>> and
>>>>>>>>>>>>>>> binaries to solr?
>>>>>>>>>>>>>>> I mean the end result is the same solr indexes text and
>>> stores
>>>>>> text
>>>>>>>>>>>>>>> So if manifold supports text extraction it seems me this
>> is
>>>> the
>>>>>>>>>> place
>>>>>>>>>>>>>>> where it should be done
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
>>>>>> Morales
>>>>>>>>>> ha
>>>>>>>>>>>>>>> scritto:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Matteo
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Manifold already handles the extraction, but the only way
>>> to
>>>>>> send
>>>>>>>>>>>>> binary
>>>>>>>>>>>>>>>> content and document metadata to Solr is using the
>>>>>> update/extract
>>>>>>>>>>>>>>> handler,
>>>>>>>>>>>>>>>> where the metadata is sent as query parameters and the
>>> binary
>>>>>>>>>>> content
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> sent in the body of the requests, allowing Solr to use
>> Tika
>>>> to
>>>>>>>>>>> obtain
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> raw content to be stored in Solr.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi During my first indexing I noticed that manifold uses
>>>> Solr
>>>>>>>>>>>>> extracting
>>>>>>>>>>>>>>>>> request handler to extract the content of an xml file
>>>>>>>>>>>>>>>>> For performance reasons it would be better if Manifold
>>>> handled
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> extraction letting Solr do the search engine
>>>>>>>>>>>>>>>>> Is this because of the connector design, framework
>> design
>>> or
>>>>>> just
>>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> done?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>>> This message should be regarded as confidential. If you
>>> have
>>>>>>>>>>> received
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> email in error please notify the sender and destroy it
>>>>>>>>>> immediately.
>>>>>>>>>>>>>>>> Statements of intent shall only become binding when
>>> confirmed
>>>>>> in
>>>>>>>>>>> hard
>>>>>>>>>>>>>>> copy
>>>>>>>>>>>>>>>> by an authorised signatory.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
>>>>>> registration
>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>> 6440931. The Registered Office is Brook House, 229
>>> Shepherds
>>>>>> Bush
>>>>>>>>>>>>> Road,
>>>>>>>>>>>>>>>> London W6 7AN.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Benedetti Alessandro
>>>>>>>>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "Tyger, tyger burning bright
>>>>>>>>>>>>>> In the forests of the night,
>>>>>>>>>>>>>> What immortal hand or eye
>>>>>>>>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>> Shinichiro Abe
>>>>>>>>>> 阿部 慎一朗
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England