Re: proposals for writing manifold connectors

Matteo Grolla Thu, 03 Jul 2014 06:03:33 -0700

Hi Karl,
        really thanks!

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com


Il giorno 03/lug/2014, alle ore 14:09, Karl Wright ha scritto:

> Hi Matteo,
> 
> I can't see a way of supporting this model other than by extending the 
> framework to support it.  I've created a ticket (CONNECTORS-989).  The 
> analysis is included, and the costs are specified as well.  Please comment in 
> the ticket about this issue going forward.
> 
> Thanks,
> Karl
> 
> 
> 
> 
> 
> On Wed, Jul 2, 2014 at 5:22 PM, Matteo Grolla <[email protected]> 
> wrote:
> Hi Karl,
>       what I'm saying is that his connector doesn't need them 
> and if this means that it can be implemented efficiently without modifying 
> MCF framework that would be great.
> 
> -- 
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
> 
> Il giorno 02/lug/2014, alle ore 20:44, Karl Wright ha scritto:
> 
>> Hi Matteo,
>> 
>> You may not need document tracking and synchronization, but that is
>> mcf's major purpose, so any general feature would have to handle that
>> case.
>> 
>> Karl
>> 
>> Sent from my Windows Phone
>> From: Matteo Grolla
>> Sent: 7/2/2014 10:49 AM
>> To: [email protected]
>> Subject: Re: proposals for writing manifold connectors
>> Hi Karl,
>>      one note (maybe obvious):
>> parent document (File) are not to be indexed in solr, I'm only
>> interested in keeping track in log reports
>> 
>>> What is not controversial is that the IProcessActivity.ingestDocument()
>> tell me if I've got it why ingestDocument must change:
>> in case of crashes/errors when mcf restarts
>> if finds Docs to be processed in the queue and it has to know the
>> corresponding File to resume processing
>> 
>> if I had a hook at the beginning of a crawl that would allow me to
>> remove all Docs instances from the queue (but leave the File
>> instances)
>> then I could just recreate the Docs instances in the queue (Files are
>> reprocessed)
>> 
>>> But the basic problem is that, in order to be able to delete child
>> Am I misunderstanding or you are thinking that
>>      if I delete FileA then I must delete DocA1..DocAN ?
>> because I don't need this.
>> 
>> thanks
>> 
>> 
>> -- 
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 02/lug/2014, alle ore 12:57, Karl Wright ha scritto:
>> 
>>> Hi Matteo,
>>> 
>>> I looked a bit into what it would require to be able to index multiple
>>> documents given a single parent document, WITHOUT the child documents
>>> hitting the ManifoldCF document queue.
>>> 
>>> What is not controversial is that the IProcessActivity.ingestDocument()
>>> method would change so that *two* document identifiers were passed in.  The
>>> first would be the parent document identifier, and the second would be a
>>> child document identifier, which you would make up presumably based on the
>>> parent as a starting point.  This would be a requirement for the
>>> incremental indexer.
>>> 
>>> But the basic problem is that, in order to be able to delete child
>>> documents from the index without involving the repository connector, we
>>> need to relate parent documents with child documents in some way, inside
>>> the incremental indexer.  There are a number of possible ways of doing
>>> this; the simplest would be to just add another column to the ingeststatus
>>> table which would allow the separation of parent and child document
>>> identifiers.  However, the simple solution is not very good because it
>>> greatly exacerbates a problem which we already have in the incremental
>>> indexer: there are multiple copies of the document version string being
>>> kept, one for each record.  Also, there is currently no logic at all in
>>> place to deal with the situation where the list of child documents shrinks;
>>> that logic would have to be worked out and there would need to be tracking
>>> to identify records that needed to go away as a result.
>>> 
>>> In short, this would be a significant change -- which is OK, but before
>>> considering it I'd have to work it through carefully, and make sure we
>>> don't lose performance etc.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> 
>>> On Tue, Jul 1, 2014 at 12:12 PM, Matteo Grolla <[email protected]>
>>> wrote:
>>> 
>>>> Hi Karl,
>>>>       first of all thanks
>>>> 
>>>>> The reason MCF is not currently structured this way is because a decision
>>>> I think that in general MCF design is sound and generic.
>>>> As connector developer I'd just like to have more flexibility in
>>>> particular situations.
>>>> Maybe what I'm searching for is already there or wouldn't be disruptive to
>>>> introduce.
>>>> Mail exchange doesn't make this kind of discussion easy
>>>> To make precise proposals I should probably give a detailed look at the
>>>> framework source code
>>>> 
>>>> --
>>>> Matteo Grolla
>>>> Sourcesense - making sense of Open Source
>>>> http://www.sourcesense.com
>>>> 
>>>> Il giorno 01/lug/2014, alle ore 16:47, Karl Wright ha scritto:
>>>> 
>>>>> Hi Matteo,
>>>>> 
>>>>> Ok, from your description it sounds like what you primarily want is for
>>>> the
>>>>> processing of one document to generate N index entries, each with its own
>>>>> version string etc.  This would never go near the queue since effectively
>>>>> you'd only be dealing with the large files there (FileA and FileB in your
>>>>> example).  You are planning to get away with doing no incremental
>>>>> management because you will simply repeat yourself if something goes
>>>> wrong
>>>>> in the middle and document processing is not completed.
>>>>> 
>>>>> The reason MCF is not currently structured this way is because a decision
>>>>> needs to be made *up front* whether to process the document or not, and
>>>>> that cannot be done in your model without actually fetching and
>>>> processing
>>>>> the large file.  So it is in fact a chicken-and-egg problem.  I will
>>>> think
>>>>> if I can see a solution to it but I've considered this in the past and
>>>> not
>>>>> found a good way to structure this kind of arrangement.  Indeed,
>>>> carry-down
>>>>> was designed in part to solve this problem.
>>>>> 
>>>>> Karl
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Jul 1, 2014 at 10:36 AM, Matteo Grolla <[email protected]
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Karl
>>>>>> 
>>>>>> I read the book and in general all principles seems sound.
>>>>>> What I'm thinking is that for some specific connectors (to say in
>>>> specific
>>>>>> conditions) people may want to exploit the specificity taking different
>>>>>> approaches
>>>>>> 
>>>>>>> I don't think database management is as difficult as you seem to think.
>>>>>> 
>>>>>> Maybe I wasn't clear on this but what I mean is this:
>>>>>> If I propose to my typical customer
>>>>>> a crawler that requires postgres even for simple crawls they's probably
>>>>>> prefer to write a custom app for simple crawls.
>>>>>> If I could at least say that the db doesn't grow a lot that would
>>>> mitigate
>>>>>> the problem.
>>>>>> I don't know if I'm the only one with these problem
>>>>>> 
>>>>>> 
>>>>>>> You have forgotten what happens when either errors occur during
>>>>>> processing,
>>>>>>> or the agents process working on your documents dies
>>>>>> 
>>>>>> Let's say fileA
>>>>>> contains DocA1.. DocA100
>>>>>> 
>>>>>> As expressed in the comment on carry down data:
>>>>>> if I have errors or the crawler dies in the processing of DocA50
>>>>>> since I want FileA to be considered processed only when all its docs
>>>> have
>>>>>> been processed
>>>>>> at restart the system should
>>>>>> reparse FileA
>>>>>> skip DocA1..DocA49 (if I'm handling versioning for them)
>>>>>> process DocA50..DocA100
>>>>>> 
>>>>>> If there's a failure I have to reparse FileA but I avoid storing 100
>>>> docs
>>>>>> in the db.
>>>>>> For me that's good, failures are not so frequent.
>>>>>> 
>>>>>>> You are forgetting the fact that MCF is incremental.
>>>>>> Let's say:
>>>>>> 
>>>>>> in the first crawl MCF processes
>>>>>> FileA dated 2014-01-01
>>>>>> containing Doc1..Doc10
>>>>>> all docs are versioned  2014-01-01
>>>>>> 
>>>>>> in the second crawl
>>>>>> FileB dated 2014-01-02
>>>>>> containing Doc1..Doc5
>>>>>> all docs are versioned  2014-01-02
>>>>>> 
>>>>>> so Doc1..Doc5 are overwritten with data from fileB
>>>>>> I don't need carry down data from previous crawl
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Matteo Grolla
>>>>>> Sourcesense - making sense of Open Source
>>>>>> http://www.sourcesense.com
>>>>>> 
>>>>>> Il giorno 01/lug/2014, alle ore 15:04, Karl Wright ha scritto:
>>>>>> 
>>>>>>> Hi Matteo,
>>>>>>> 
>>>>>>> I don't think database management is as difficult as you seem to think.
>>>>>>> But more importantly, you seem to have issues with very basic aspects
>>>> of
>>>>>>> the ManifoldCF design.  It may help to read the book (see
>>>>>>> https://manifoldcf.apache.org/en_US/books-and-presentations.html), to
>>>>>>> understand more completely where the design decisions came from.
>>>>>>> 
>>>>>>> In short, ManifoldCF is built on top of a database for the following
>>>>>>> reasons:
>>>>>>> - resilience
>>>>>>> - restartability
>>>>>>> - control of memory footprint
>>>>>>> - synchronization
>>>>>>> 
>>>>>>> It was, in fact, designed around the capabilities of databases (not
>>>> just
>>>>>>> postgresql, but all modern databases), so it is not surprising that it
>>>>>> uses
>>>>>>> the database for everything persistent at all.  Incremental crawling
>>>>>> means
>>>>>>> that even more things need to be persistent in ManifoldCF than they
>>>> might
>>>>>>> in other crawling designs.
>>>>>>> 
>>>>>>> So I suggest that you carefully read the first chapter of MCF in
>>>> Action,
>>>>>>> and consider the design point of this crawler carefully.
>>>>>>> 
>>>>>>> As for your specific questions:
>>>>>>> 
>>>>>>> 'I can't say: "before processing another file process all the docs from
>>>>>>> previous files"' - Yes, and the reason for that is because the MCF
>>>> queue
>>>>>> is
>>>>>>> built once again in the database.  Documents that are to be processed
>>>>>> must
>>>>>>> be queried, and that query must be performant.
>>>>>>> 
>>>>>>> 'do I need intra cluster synchronization to process the docs contained
>>>>>> in a
>>>>>>> file?
>>>>>>> If I state that the machine that processed the file is the one that
>>>>>>> processes the docs contained in it then I don't.'
>>>>>>> 
>>>>>>> You have forgotten what happens when either errors occur during
>>>>>> processing,
>>>>>>> or the agents process working on your documents dies (because it got
>>>>>>> killed, say).  Unless you want to lose all context, then you need
>>>>>>> persistent storage.
>>>>>>> 
>>>>>>> 'If it's difficult to do without a db for carry down data I'd like that
>>>>>>> table to remain small, maybe empty it at the end of every crawl.
>>>>>>> How could I do that?'
>>>>>>> 
>>>>>>> You are forgetting the fact that MCF is incremental.  If you want it to
>>>>>> do
>>>>>>> the minimum work on a subsequent crawl, it has to keep track of what
>>>>>> inputs
>>>>>>> are around for each document to be processed.
>>>>>>> 
>>>>>>> Karl
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla <
>>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Karl,
>>>>>>>>     glad it's appreciated.
>>>>>>>> 
>>>>>>>> concerning the answer:
>>>>>>>> 
>>>>>>>> My intention is not to avoid using the db, just limiting it's use to
>>>>>> what
>>>>>>>> strictly necessary.
>>>>>>>> And surely I don't want to find new way to handle intra-cluster
>>>>>>>> communication.
>>>>>>>> 
>>>>>>>> For me it would be ok to keep track of crawled docIds an versions in
>>>> the
>>>>>>>> db, I'd just like to avoid putting there carry down data.
>>>>>>>> 
>>>>>>>> Motivations:
>>>>>>>> I recently worked on a couple of projects dealing about crawling data
>>>>>> from
>>>>>>>> external data sources and importing it in solr.
>>>>>>>> One was implemented as a Lucidworks Search connector
>>>>>>>> The other was a custom crawling app.
>>>>>>>> In both cases the crawling part was simple, the data source was giving
>>>>>> me
>>>>>>>> all the modified / deleted documents  within a time interval.
>>>>>>>> The processing pipeline to enrich and transform the documents was more
>>>>>>>> involved.
>>>>>>>> 
>>>>>>>> In cases like these I'd like to just focus on sizing the crawler and
>>>>>> solr
>>>>>>>> instances.
>>>>>>>> If I have to size a db I'll have to deal with a dba and many customers
>>>>>> are
>>>>>>>> not experienced on Postgres so mcf solution becomes less appealing.
>>>>>>>> Even if I find a postgres dba  I'll have to deal with him for things
>>>>>> like
>>>>>>>> performance problems, size…
>>>>>>>> All things I'd like to avoid if not strictly necessary
>>>>>>>> 
>>>>>>>> please correct me if I'm wrong in what follows
>>>>>>>> Why do I need carry down data in db?
>>>>>>>> because I wan't bounded memory usage and have no control on the order
>>>>>> mcf
>>>>>>>> follows in processing docs
>>>>>>>> I can't say: "before processing another file process all the docs from
>>>>>>>> previous files"
>>>>>>>> do I need intra cluster synchronization to process the docs contained
>>>>>> in a
>>>>>>>> file?
>>>>>>>> If I state that the machine that processed the file is the one that
>>>>>>>> processes the docs contained in it then I don't.
>>>>>>>> 
>>>>>>>> What do you think?
>>>>>>>> If it's difficult to do without a db for carry down data I'd like that
>>>>>>>> table to remain small, maybe empty it at the end of every crawl.
>>>>>>>> How could I do that?
>>>>>>>> 
>>>>>>>> If I were to synthesize this mail in one sentence I'd say:
>>>>>>>> "Given simple crawling requirements I'd like o be able to implement an
>>>>>> MCF
>>>>>>>> solution that is performant and simple to manage"
>>>>>>>> 
>>>>>>>> thanks
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Matteo Grolla
>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>> http://www.sourcesense.com
>>>>>>>> 
>>>>>>>> Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto:
>>>>>>>> 
>>>>>>>>> Hi Matteo,
>>>>>>>>> 
>>>>>>>>> Thank you for the work you put in here.
>>>>>>>>> 
>>>>>>>>> I have a response to one particular question buried deep in your
>>>> code:
>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>      FIXME: carrydown values
>>>>>>>>>        serializing the documents seems a quick way to reach the
>>>> goal
>>>>>>>>>        what is the size limit for this data in the sql table?
>>>>>>>>> 
>>>>>>>>>        PROPOSAL
>>>>>>>>>        What I'd really like is avoiding the use of a db table for
>>>>>>>>> carrydown data and keep them in memory
>>>>>>>>>        something like:
>>>>>>>>>          MCF starts processing File A
>>>>>>>>>          docs A1, A2, ... AN are added to the queue
>>>>>>>>>          MCF starts processing File B
>>>>>>>>>          docs B1, B2, ... are added to the queue
>>>>>>>>>          and so on...
>>>>>>>>>          as soon as all docs A1..AN have been processed, A is
>>>>>>>>> considered processed
>>>>>>>>>          in case of failure (manifold is restarted in the middle
>>>>>>>>> of a crawl)
>>>>>>>>>            all files (A, B...) should be reprocessed
>>>>>>>>>          the size of the queue should be bounded
>>>>>>>>>            once filled MCF should stop processing files untill
>>>>>>>>> more docs are processed
>>>>>>>>> 
>>>>>>>>>        MOTIVATION
>>>>>>>>>        -I'd like to avoid putting pressure on the db if possible,
>>>>>>>>> so that it doesn't become a concern in production
>>>>>>>>>        -performance
>>>>>>>>> <<<<<<
>>>>>>>>> 
>>>>>>>>> Carrydown is explicitly designed to use unlimited-length database
>>>>>>>>> fields.  Your proposal would work OK only within a single cluster
>>>>>>>>> member; however, among multiple cluster
>>>>>>>>> members it could not work.  The database is the ManifoldCF medium of
>>>>>>>>> choice for handling stateful information and for handling
>>>>>>>>> cross-cluster data requirements.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <
>>>> [email protected]
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>>    I wrote a repository connector for crawling solrxml files
>>>>>>>>>> 
>>>>>>>>>> https://github.com/matteogrolla/mcf-filesystem-xml-connector
>>>>>>>>>> 
>>>>>>>>>> The work is based on the filesystem connector but I made several
>>>>>>>> hopefully
>>>>>>>>>> interesting changes which could be applied elsewhere.
>>>>>>>>>> I have also a couple of questions
>>>>>>>>>> For details see the read me file
>>>>>>>>>> 
>>>>>>>>>> Matteo
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
> 
>

Re: proposals for writing manifold connectors

Reply via email to