Re: proposals for writing manifold connectors

Matteo Grolla Wed, 02 Jul 2014 07:50:42 -0700

Hi Karl,
        one note (maybe obvious): 
parent document (File) are not to be indexed in solr, I'm only interested in 
keeping track in log reports


> What is not controversial is that the IProcessActivity.ingestDocument()
tell me if I've got it why ingestDocument must change:
in case of crashes/errors when mcf restarts
if finds Docs to be processed in the queue and it has to know the corresponding 
File to resume processing

if I had a hook at the beginning of a crawl that would allow me to remove all 
Docs instances from the queue (but leave the File instances)
then I could just recreate the Docs instances in the queue (Files are 
reprocessed)

> But the basic problem is that, in order to be able to delete child
Am I misunderstanding or you are thinking that 
        if I delete FileA then I must delete DocA1..DocAN ?
because I don't need this.

thanks


-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 02/lug/2014, alle ore 12:57, Karl Wright ha scritto:

> Hi Matteo,
> 
> I looked a bit into what it would require to be able to index multiple
> documents given a single parent document, WITHOUT the child documents
> hitting the ManifoldCF document queue.
> 
> What is not controversial is that the IProcessActivity.ingestDocument()
> method would change so that *two* document identifiers were passed in.  The
> first would be the parent document identifier, and the second would be a
> child document identifier, which you would make up presumably based on the
> parent as a starting point.  This would be a requirement for the
> incremental indexer.
> 
> But the basic problem is that, in order to be able to delete child
> documents from the index without involving the repository connector, we
> need to relate parent documents with child documents in some way, inside
> the incremental indexer.  There are a number of possible ways of doing
> this; the simplest would be to just add another column to the ingeststatus
> table which would allow the separation of parent and child document
> identifiers.  However, the simple solution is not very good because it
> greatly exacerbates a problem which we already have in the incremental
> indexer: there are multiple copies of the document version string being
> kept, one for each record.  Also, there is currently no logic at all in
> place to deal with the situation where the list of child documents shrinks;
> that logic would have to be worked out and there would need to be tracking
> to identify records that needed to go away as a result.
> 
> In short, this would be a significant change -- which is OK, but before
> considering it I'd have to work it through carefully, and make sure we
> don't lose performance etc.
> 
> Thanks,
> Karl
> 
> 
> 
> On Tue, Jul 1, 2014 at 12:12 PM, Matteo Grolla <[email protected]>
> wrote:
> 
>> Hi Karl,
>>        first of all thanks
>> 
>>> The reason MCF is not currently structured this way is because a decision
>> I think that in general MCF design is sound and generic.
>> As connector developer I'd just like to have more flexibility in
>> particular situations.
>> Maybe what I'm searching for is already there or wouldn't be disruptive to
>> introduce.
>> Mail exchange doesn't make this kind of discussion easy
>> To make precise proposals I should probably give a detailed look at the
>> framework source code
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 01/lug/2014, alle ore 16:47, Karl Wright ha scritto:
>> 
>>> Hi Matteo,
>>> 
>>> Ok, from your description it sounds like what you primarily want is for
>> the
>>> processing of one document to generate N index entries, each with its own
>>> version string etc.  This would never go near the queue since effectively
>>> you'd only be dealing with the large files there (FileA and FileB in your
>>> example).  You are planning to get away with doing no incremental
>>> management because you will simply repeat yourself if something goes
>> wrong
>>> in the middle and document processing is not completed.
>>> 
>>> The reason MCF is not currently structured this way is because a decision
>>> needs to be made *up front* whether to process the document or not, and
>>> that cannot be done in your model without actually fetching and
>> processing
>>> the large file.  So it is in fact a chicken-and-egg problem.  I will
>> think
>>> if I can see a solution to it but I've considered this in the past and
>> not
>>> found a good way to structure this kind of arrangement.  Indeed,
>> carry-down
>>> was designed in part to solve this problem.
>>> 
>>> Karl
>>> 
>>> 
>>> 
>>> On Tue, Jul 1, 2014 at 10:36 AM, Matteo Grolla <[email protected]
>>> 
>>> wrote:
>>> 
>>>> Hi Karl
>>>> 
>>>> I read the book and in general all principles seems sound.
>>>> What I'm thinking is that for some specific connectors (to say in
>> specific
>>>> conditions) people may want to exploit the specificity taking different
>>>> approaches
>>>> 
>>>>> I don't think database management is as difficult as you seem to think.
>>>> 
>>>> Maybe I wasn't clear on this but what I mean is this:
>>>> If I propose to my typical customer
>>>> a crawler that requires postgres even for simple crawls they's probably
>>>> prefer to write a custom app for simple crawls.
>>>> If I could at least say that the db doesn't grow a lot that would
>> mitigate
>>>> the problem.
>>>> I don't know if I'm the only one with these problem
>>>> 
>>>> 
>>>>> You have forgotten what happens when either errors occur during
>>>> processing,
>>>>> or the agents process working on your documents dies
>>>> 
>>>> Let's say fileA
>>>> contains DocA1.. DocA100
>>>> 
>>>> As expressed in the comment on carry down data:
>>>> if I have errors or the crawler dies in the processing of DocA50
>>>> since I want FileA to be considered processed only when all its docs
>> have
>>>> been processed
>>>> at restart the system should
>>>> reparse FileA
>>>> skip DocA1..DocA49 (if I'm handling versioning for them)
>>>> process DocA50..DocA100
>>>> 
>>>> If there's a failure I have to reparse FileA but I avoid storing 100
>> docs
>>>> in the db.
>>>> For me that's good, failures are not so frequent.
>>>> 
>>>>> You are forgetting the fact that MCF is incremental.
>>>> Let's say:
>>>> 
>>>> in the first crawl MCF processes
>>>> FileA dated 2014-01-01
>>>> containing Doc1..Doc10
>>>> all docs are versioned  2014-01-01
>>>> 
>>>> in the second crawl
>>>> FileB dated 2014-01-02
>>>> containing Doc1..Doc5
>>>> all docs are versioned  2014-01-02
>>>> 
>>>> so Doc1..Doc5 are overwritten with data from fileB
>>>> I don't need carry down data from previous crawl
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Matteo Grolla
>>>> Sourcesense - making sense of Open Source
>>>> http://www.sourcesense.com
>>>> 
>>>> Il giorno 01/lug/2014, alle ore 15:04, Karl Wright ha scritto:
>>>> 
>>>>> Hi Matteo,
>>>>> 
>>>>> I don't think database management is as difficult as you seem to think.
>>>>> But more importantly, you seem to have issues with very basic aspects
>> of
>>>>> the ManifoldCF design.  It may help to read the book (see
>>>>> https://manifoldcf.apache.org/en_US/books-and-presentations.html), to
>>>>> understand more completely where the design decisions came from.
>>>>> 
>>>>> In short, ManifoldCF is built on top of a database for the following
>>>>> reasons:
>>>>> - resilience
>>>>> - restartability
>>>>> - control of memory footprint
>>>>> - synchronization
>>>>> 
>>>>> It was, in fact, designed around the capabilities of databases (not
>> just
>>>>> postgresql, but all modern databases), so it is not surprising that it
>>>> uses
>>>>> the database for everything persistent at all.  Incremental crawling
>>>> means
>>>>> that even more things need to be persistent in ManifoldCF than they
>> might
>>>>> in other crawling designs.
>>>>> 
>>>>> So I suggest that you carefully read the first chapter of MCF in
>> Action,
>>>>> and consider the design point of this crawler carefully.
>>>>> 
>>>>> As for your specific questions:
>>>>> 
>>>>> 'I can't say: "before processing another file process all the docs from
>>>>> previous files"' - Yes, and the reason for that is because the MCF
>> queue
>>>> is
>>>>> built once again in the database.  Documents that are to be processed
>>>> must
>>>>> be queried, and that query must be performant.
>>>>> 
>>>>> 'do I need intra cluster synchronization to process the docs contained
>>>> in a
>>>>> file?
>>>>> If I state that the machine that processed the file is the one that
>>>>> processes the docs contained in it then I don't.'
>>>>> 
>>>>> You have forgotten what happens when either errors occur during
>>>> processing,
>>>>> or the agents process working on your documents dies (because it got
>>>>> killed, say).  Unless you want to lose all context, then you need
>>>>> persistent storage.
>>>>> 
>>>>> 'If it's difficult to do without a db for carry down data I'd like that
>>>>> table to remain small, maybe empty it at the end of every crawl.
>>>>> How could I do that?'
>>>>> 
>>>>> You are forgetting the fact that MCF is incremental.  If you want it to
>>>> do
>>>>> the minimum work on a subsequent crawl, it has to keep track of what
>>>> inputs
>>>>> are around for each document to be processed.
>>>>> 
>>>>> Karl
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla <
>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi Karl,
>>>>>>      glad it's appreciated.
>>>>>> 
>>>>>> concerning the answer:
>>>>>> 
>>>>>> My intention is not to avoid using the db, just limiting it's use to
>>>> what
>>>>>> strictly necessary.
>>>>>> And surely I don't want to find new way to handle intra-cluster
>>>>>> communication.
>>>>>> 
>>>>>> For me it would be ok to keep track of crawled docIds an versions in
>> the
>>>>>> db, I'd just like to avoid putting there carry down data.
>>>>>> 
>>>>>> Motivations:
>>>>>> I recently worked on a couple of projects dealing about crawling data
>>>> from
>>>>>> external data sources and importing it in solr.
>>>>>> One was implemented as a Lucidworks Search connector
>>>>>> The other was a custom crawling app.
>>>>>> In both cases the crawling part was simple, the data source was giving
>>>> me
>>>>>> all the modified / deleted documents  within a time interval.
>>>>>> The processing pipeline to enrich and transform the documents was more
>>>>>> involved.
>>>>>> 
>>>>>> In cases like these I'd like to just focus on sizing the crawler and
>>>> solr
>>>>>> instances.
>>>>>> If I have to size a db I'll have to deal with a dba and many customers
>>>> are
>>>>>> not experienced on Postgres so mcf solution becomes less appealing.
>>>>>> Even if I find a postgres dba  I'll have to deal with him for things
>>>> like
>>>>>> performance problems, size…
>>>>>> All things I'd like to avoid if not strictly necessary
>>>>>> 
>>>>>> please correct me if I'm wrong in what follows
>>>>>> Why do I need carry down data in db?
>>>>>> because I wan't bounded memory usage and have no control on the order
>>>> mcf
>>>>>> follows in processing docs
>>>>>> I can't say: "before processing another file process all the docs from
>>>>>> previous files"
>>>>>> do I need intra cluster synchronization to process the docs contained
>>>> in a
>>>>>> file?
>>>>>> If I state that the machine that processed the file is the one that
>>>>>> processes the docs contained in it then I don't.
>>>>>> 
>>>>>> What do you think?
>>>>>> If it's difficult to do without a db for carry down data I'd like that
>>>>>> table to remain small, maybe empty it at the end of every crawl.
>>>>>> How could I do that?
>>>>>> 
>>>>>> If I were to synthesize this mail in one sentence I'd say:
>>>>>> "Given simple crawling requirements I'd like o be able to implement an
>>>> MCF
>>>>>> solution that is performant and simple to manage"
>>>>>> 
>>>>>> thanks
>>>>>> 
>>>>>> --
>>>>>> Matteo Grolla
>>>>>> Sourcesense - making sense of Open Source
>>>>>> http://www.sourcesense.com
>>>>>> 
>>>>>> Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto:
>>>>>> 
>>>>>>> Hi Matteo,
>>>>>>> 
>>>>>>> Thank you for the work you put in here.
>>>>>>> 
>>>>>>> I have a response to one particular question buried deep in your
>> code:
>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> 
>>>>>>>       FIXME: carrydown values
>>>>>>>         serializing the documents seems a quick way to reach the
>> goal
>>>>>>>         what is the size limit for this data in the sql table?
>>>>>>> 
>>>>>>>         PROPOSAL
>>>>>>>         What I'd really like is avoiding the use of a db table for
>>>>>>> carrydown data and keep them in memory
>>>>>>>         something like:
>>>>>>>           MCF starts processing File A
>>>>>>>           docs A1, A2, ... AN are added to the queue
>>>>>>>           MCF starts processing File B
>>>>>>>           docs B1, B2, ... are added to the queue
>>>>>>>           and so on...
>>>>>>>           as soon as all docs A1..AN have been processed, A is
>>>>>>> considered processed
>>>>>>>           in case of failure (manifold is restarted in the middle
>>>>>>> of a crawl)
>>>>>>>             all files (A, B...) should be reprocessed
>>>>>>>           the size of the queue should be bounded
>>>>>>>             once filled MCF should stop processing files untill
>>>>>>> more docs are processed
>>>>>>> 
>>>>>>>         MOTIVATION
>>>>>>>         -I'd like to avoid putting pressure on the db if possible,
>>>>>>> so that it doesn't become a concern in production
>>>>>>>         -performance
>>>>>>> <<<<<<
>>>>>>> 
>>>>>>> Carrydown is explicitly designed to use unlimited-length database
>>>>>>> fields.  Your proposal would work OK only within a single cluster
>>>>>>> member; however, among multiple cluster
>>>>>>> members it could not work.  The database is the ManifoldCF medium of
>>>>>>> choice for handling stateful information and for handling
>>>>>>> cross-cluster data requirements.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <
>> [email protected]
>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>>     I wrote a repository connector for crawling solrxml files
>>>>>>>> 
>>>>>>>> https://github.com/matteogrolla/mcf-filesystem-xml-connector
>>>>>>>> 
>>>>>>>> The work is based on the filesystem connector but I made several
>>>>>> hopefully
>>>>>>>> interesting changes which could be applied elsewhere.
>>>>>>>> I have also a couple of questions
>>>>>>>> For details see the read me file
>>>>>>>> 
>>>>>>>> Matteo
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: proposals for writing manifold connectors

Reply via email to