Hi Karl,
I am proceeding modifying the Solr connector introducing a new flag that
will control the operative mode :
1) using Extract Update handler ( as it is right now)
2) using the SolrInputDocument and classic SolrJ add.

I will introduce a flag checkbox as we did for the "keepAllMetadata" .

Is already an issue for that karl?
Let me know!

Cheers


2014-06-18 16:10 GMT+01:00 Karl Wright <[email protected]>:

> Hi Alessandro,
>
> The reason for backwards compatibility is obvious: people upgrade
> ManifoldCF all the time, and when they do it should not stop working for
> them.
>
> Putting Tika all the time in the pipeline is also not appropriate for other
> output connections.  Even if you did it just for Solr, you'd then have to
> insure that the Tika transformer was exactly compatible with Solr Cell,
> which I would be very uncomfortable with agreeing to.
>
> So let's presume that you'd do one of two things.  Either:
>
> - Leave the existing Solr connector alone, and create a whole new Solr
> connector designed to work with a Tika transformer, or
> - Modify the existing Solr connector so that it operates in two possible
> modes, one of which supports the legacy model (the default), and one of
> which supports your new model
>
> If this sounds overly burdensome, I'm sorry but it's necessary until MCF
> 2.0.  For MCF 2.0, which I've begun to think about, we can dispense with
> backwards compatibility, including legacy tabs that have outlived their
> usefulness, etc.  But that's not a 1.7 solution.
>
> Karl
>
>
>
> On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti <
> [email protected]> wrote:
>
> > Hello Karl,
> > What i was thinking is:
> > assuming we have the Tika Connector, the responsibility to extract
> content
> > will pass from Solr to the Tika processor.
> >
> > So we can change the part in the Solr Connector that manages the building
> > of the request to send to the Extract update handler.
> > Particularly that part will change in the classic way: usually it's good
> to
> > build a SolrDocument in SolrJ and then add it to SolrServer.
> >
> > Why should we give retrocompatibility from Solr Connector point of view ?
> > From the user point of view, a Job will be selected with the Tika
> Conenctor
> > in the pipeline, so we are providing the same identical feature.
> > One way can be to make the Tika Processor Connector by default in the
> > pipeline, and someone will be able to deactivate it only if needed.
> >
> > Cheers
> >
> >
> >
> > 2014-06-18 14:32 GMT+01:00 Karl Wright <[email protected]>:
> >
> > > Hi Alessandro,
> > > What is your concrete proposal to change the Solr connector?  Bear in
> > mind
> > > that we do need to maintain backwards compatibility.  If you list your
> > > specific changes, not in any huge detail, but with enough detail that
> we
> > > understand your proposal, that would help.  What happens to the UI?
>  What
> > > happens to the internals?
> > >
> > > Thanks,
> > > Karl
> > >
> > >
> > >
> > > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
> > > [email protected]> wrote:
> > >
> > > > But guys, why not simply pass to a classic SolrJ SolrDocument
> creation
> > > and
> > > > ingestion in the Solr Server ? Easy and Straighforward !
> > > >
> > > > In the end at that point the RepositoryDocument will me only a Map of
> > > > metadata and values.
> > > > Content will be part of that, so I guess the conversion to a
> > SolrDocument
> > > > will be immediate.
> > > >
> > > > Cheers
> > > >
> > > >
> > > > 2014-06-18 3:26 GMT+01:00 Karl Wright <[email protected]>:
> > > >
> > > > > Hi Abe-san,
> > > > >
> > > > > Near as I can tell, the major consumer of disk space is the Maven
> > > target
> > > > > directories.  This is generating many tens of megabytes of
> temporary
> > > disk
> > > > > usage for every connector.  Luckily if you use ant, this is not a
> > > > problem.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Abe-san,
> > > > > >
> > > > > > Tika jars are not very big:
> > > > > >
> > > > > > C:\wip\mcf\trunk\lib>dir tika*
> > > > > >  Volume in drive C has no label.
> > > > > >  Volume Serial Number is 002E-D1F0
> > > > > >
> > > > > >  Directory of C:\wip\mcf\trunk\lib
> > > > > >
> > > > > > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > > > > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> > > > > >                2 File(s)      1,017,051 bytes
> > > > > >                0 Dir(s)  140,792,315,904 bytes free
> > > > > >
> > > > > > The entire lib directory is 85M:
> > > > > >
> > > > > > 85,156,330 bytes
> > > > > >
> > > > > > The built binary image is still about 185Mb, I believe.  So I
> don't
> > > > know
> > > > > > why you think it is >1Gb?  Temporary class files?  I don't think
> we
> > > can
> > > > > > avoid those.
> > > > > >
> > > > > > I'd rather not make things more complicated than they need to be
> by
> > > > > adding
> > > > > > a new required service - even though it would fit naturally with
> > the
> > > > > > connector arrangement.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > >> Hi Karl,
> > > > > >>
> > > > > >> Okay, I assumed Tika connector outputs files.
> > > > > >> If we post character data metadata got from Tika,
> > "/update/extract"
> > > > > >> handler
> > > > > >> can handle this(provides params:
> > > > > >> literal.content=value&literal.metaField=foobar
> > > > > >> with using NullInputStream for binary data like CONNECTORS-936).
> > > > > >>
> > > > > >> BTW, now trunk built size is too big(1G+). Maybe because
> > CloudSearch
> > > > > >> connector uses Tika jars.
> > > > > >> Tika connector and CloudSearch connector should extract text via
> > > > > >> tika-server[1]
> > > > > >> and MCF should not have many Tika jars, do you think?
> > > > > >>
> > > > > >> [1]
> > > > > >> http://wiki.apache.org/tika/TikaJAXRS
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Shinichiro Abe
> > > > > >>
> > > > > >> On 2014/06/18, at 9:45, Karl Wright <[email protected]> wrote:
> > > > > >>
> > > > > >> > Hi Abe-san,
> > > > > >> >
> > > > > >> > It sounds like you might be thinking that transformation
> > > connectors
> > > > > are
> > > > > >> > like output connectors.  Just so we are clear, transformation
> > > > > >> connectors in
> > > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a
> > > > > >> > RepositoryDocument on to the next connector in the chain.  So
> I
> > > > don't
> > > > > >> know
> > > > > >> > why .xml files would be involved.  I'd expect the Tika
> connector
> > > to
> > > > > >> read a
> > > > > >> > binary file from one RepositoryDocument object and convert its
> > > > > contents
> > > > > >> to
> > > > > >> > another RepositoryDocument object which would have character
> > data
> > > > and
> > > > > >> > metadata only.  Would this work for your case, do you think?
> > > > > >> >
> > > > > >> > Karl
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> > > > > >> [email protected]>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> >> Hi Karl,
> > > > > >> >>
> > > > > >> >> Yes. I thought the standard update handler met that
> > requirement.
> > > > > >> >> For instance, Tika extractor transformation connector creates
> > two
> > > > > >> files.
> > > > > >> >> 1. addtoSolr.xml for add and update
> > > > > >> >> 2. deletetoSolr.xml for delete
> > > > > >> >> File connector ingests these xml files, then Solr connector
> > posts
> > > > > these
> > > > > >> >> files by "/update" handler.
> > > > > >> >>
> > > > > >> >> In the the Solr Connector, other function as to update
> handler
> > > > > >> >> might not be necessary except for  "/update" handler.
> > > > > >> >>
> > > > > >> >> Thanks,
> > > > > >> >> Shinichiro Abe
> > > > > >> >>
> > > > > >> >> On 2014/06/18, at 8:02, Karl Wright <[email protected]>
> > wrote:
> > > > > >> >>
> > > > > >> >>> Hi Abe-san,
> > > > > >> >>>
> > > > > >> >>> So just to be sure -- you believe that no changes at all are
> > > > > required
> > > > > >> to
> > > > > >> >>> the Solr Connector as it stands now, other than to use the
> > > update
> > > > > >> handler
> > > > > >> >>> rather than the /update/extract handler?
> > > > > >> >>>
> > > > > >> >>> Karl
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> > > > > >> >> [email protected]>
> > > > > >> >>> wrote:
> > > > > >> >>>
> > > > > >> >>>>> As for changing the Solr connector so that it doesn't go
> to
> > > the
> > > > > >> >> extracting
> > > > > >> >>>> update handler
> > > > > >> >>>>
> > > > > >> >>>> I don't think it needs to change Solr connector with new
> > > checkbox
> > > > > >> >> because
> > > > > >> >>>> currently we can change "/update/extract" into "/update" at
> > > > 'Update
> > > > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I
> > could
> > > > > post
> > > > > >> >> CSV,
> > > > > >> >>>> JSON and XML files to Solr by changing that and using File
> > > > > connector.
> > > > > >> >> So I
> > > > > >> >>>> wish we allow Tika extractor transformation connector to
> > create
> > > > XML
> > > > > >> >> files
> > > > > >> >>>> that Solr expects to see.
> > > > > >> >>>>
> > > > > >> >>>> Regards,
> > > > > >> >>>> Shinichiro Abe
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <[email protected]
> >:
> > > > > >> >>>>
> > > > > >> >>>>> The pipeline code itself is now "complete" in trunk.
>  Zaizi
> > > said
> > > > > >> they'd
> > > > > >> >>>>> contribute a Tika extractor transformation connector - and
> > if
> > > > they
> > > > > >> >> don't
> > > > > >> >>>>> get around to that in a month or so, I may take a crack at
> > it
> > > > > >> myself.
> > > > > >> >>>>>
> > > > > >> >>>>> As for changing the Solr connector so that it doesn't go
> to
> > > the
> > > > > >> >>>> extracting
> > > > > >> >>>>> update handler, it would be great if:
> > > > > >> >>>>> (1) Someone created a ticket for this, and
> > > > > >> >>>>> (2) A patch was provided that maintains backwards
> > > compatibility
> > > > > with
> > > > > >> >>>>> previous versions of the connector (so a checkbox would
> > > probably
> > > > > >> need
> > > > > >> >> to
> > > > > >> >>>> go
> > > > > >> >>>>> into the UI somewhere).  Do either of you want to start
> this
> > > > > >> process?
> > > > > >> >>>>>
> > > > > >> >>>>> Thanks!
> > > > > >> >>>>> Karl
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
> > > > [email protected]
> > > > > >
> > > > > >> >>>> wrote:
> > > > > >> >>>>>
> > > > > >> >>>>>> Hi guys,
> > > > > >> >>>>>>
> > > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a
> full
> > > > > >> pipeline,
> > > > > >> >>>> and
> > > > > >> >>>>>> is expected to have a Tika extractor as a transformation
> > > > > connector.
> > > > > >> >>>>>>
> > > > > >> >>>>>> Karl
> > > > > >> >>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > > > > >> >>>>> [email protected]>
> > > > > >> >>>>>> wrote:
> > > > > >> >>>>>>
> > > > > >> >>>>>>> Thanks Alessandro,
> > > > > >> >>>>>>>       that explains the situation clearly.
> > > > > >> >>>>>>> And I agree that sending all the metadata as get
> parameter
> > > can
> > > > > be
> > > > > >> >>>>>>> problematic
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Cheers
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> --
> > > > > >> >>>>>>> Matteo Grolla
> > > > > >> >>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro
> > Benedetti
> > > ha
> > > > > >> >>>> scritto:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
> > > > extractors.
> > > > > >> >>>>>>>> The Repository connectors extracts directly the binary
> > and
> > > > > there
> > > > > >> is
> > > > > >> >>>> no
> > > > > >> >>>>>>>> "Extractor Processor" yet.
> > > > > >> >>>>>>>> But recently a pipe-line processor architecture has
> been
> > > > > thought
> > > > > >> (
> > > > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> > > > > >> >>>>>>>> So can fit there.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Cheers
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> > > > > >> [email protected]
> > > > > >> >>>>> :
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>> Since Solr extracting request handler takes the binary
> > and
> > > > > >> extracts
> > > > > >> >>>>>>> text
> > > > > >> >>>>>>>>> what is the point of not using Manifold extractor and
> > send
> > > > > text
> > > > > >> and
> > > > > >> >>>>>>>>> binaries to solr?
> > > > > >> >>>>>>>>> I mean the end result is the same solr indexes text
> and
> > > > stores
> > > > > >> text
> > > > > >> >>>>>>>>> So if manifold supports text extraction it seems me
> this
> > > is
> > > > > the
> > > > > >> >>>> place
> > > > > >> >>>>>>>>> where it should be done
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> --
> > > > > >> >>>>>>>>> Matteo Grolla
> > > > > >> >>>>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David
> > Perez
> > > > > >> Morales
> > > > > >> >>>> ha
> > > > > >> >>>>>>>>> scritto:
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>>> Hi Matteo
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Manifold already handles the extraction, but the only
> > way
> > > > to
> > > > > >> send
> > > > > >> >>>>>>> binary
> > > > > >> >>>>>>>>>> content and document metadata to Solr is using the
> > > > > >> update/extract
> > > > > >> >>>>>>>>> handler,
> > > > > >> >>>>>>>>>> where the metadata is sent as query parameters and
> the
> > > > binary
> > > > > >> >>>>> content
> > > > > >> >>>>>>> is
> > > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to
> use
> > > Tika
> > > > > to
> > > > > >> >>>>> obtain
> > > > > >> >>>>>>> the
> > > > > >> >>>>>>>>>> raw content to be stored in Solr.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Regards
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > > > > >> >>>>>>> [email protected]
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> wrote:
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold
> > uses
> > > > > Solr
> > > > > >> >>>>>>> extracting
> > > > > >> >>>>>>>>>>> request handler to extract the content of an xml
> file
> > > > > >> >>>>>>>>>>> For performance reasons it would be better if
> Manifold
> > > > > handled
> > > > > >> >>>> the
> > > > > >> >>>>>>>>>>> extraction letting Solr do the search engine
> > > > > >> >>>>>>>>>>> Is this because of the connector design, framework
> > > design
> > > > or
> > > > > >> just
> > > > > >> >>>>> to
> > > > > >> >>>>>>> be
> > > > > >> >>>>>>>>>>> done?
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>> --
> > > > > >> >>>>>>>>>>> Matteo Grolla
> > > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> --
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> ------------------------------
> > > > > >> >>>>>>>>>> This message should be regarded as confidential. If
> you
> > > > have
> > > > > >> >>>>> received
> > > > > >> >>>>>>>>> this
> > > > > >> >>>>>>>>>> email in error please notify the sender and destroy
> it
> > > > > >> >>>> immediately.
> > > > > >> >>>>>>>>>> Statements of intent shall only become binding when
> > > > confirmed
> > > > > >> in
> > > > > >> >>>>> hard
> > > > > >> >>>>>>>>> copy
> > > > > >> >>>>>>>>>> by an authorised signatory.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> > > > > >> registration
> > > > > >> >>>>>>> number
> > > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229
> > > > Shepherds
> > > > > >> Bush
> > > > > >> >>>>>>> Road,
> > > > > >> >>>>>>>>>> London W6 7AN.
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> --
> > > > > >> >>>>>>>> --------------------------
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Benedetti Alessandro
> > > > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> "Tyger, tyger burning bright
> > > > > >> >>>>>>>> In the forests of the night,
> > > > > >> >>>>>>>> What immortal hand or eye
> > > > > >> >>>>>>>> Could frame thy fearful symmetry?"
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> --
> > > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> > - -
> > > > > >> >>>> Shinichiro Abe
> > > > > >> >>>> 阿部 慎一朗
> > > > > >> >>>>
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card : http://about.me/alessandro_benedetti
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply via email to