Hi Karl,
        what I'm saying is that his connector doesn't need them 
and if this means that it can be implemented efficiently without modifying MCF 
framework that would be great.

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 02/lug/2014, alle ore 20:44, Karl Wright ha scritto:

> Hi Matteo,
> 
> You may not need document tracking and synchronization, but that is
> mcf's major purpose, so any general feature would have to handle that
> case.
> 
> Karl
> 
> Sent from my Windows Phone
> From: Matteo Grolla
> Sent: 7/2/2014 10:49 AM
> To: [email protected]
> Subject: Re: proposals for writing manifold connectors
> Hi Karl,
>       one note (maybe obvious):
> parent document (File) are not to be indexed in solr, I'm only
> interested in keeping track in log reports
> 
>> What is not controversial is that the IProcessActivity.ingestDocument()
> tell me if I've got it why ingestDocument must change:
> in case of crashes/errors when mcf restarts
> if finds Docs to be processed in the queue and it has to know the
> corresponding File to resume processing
> 
> if I had a hook at the beginning of a crawl that would allow me to
> remove all Docs instances from the queue (but leave the File
> instances)
> then I could just recreate the Docs instances in the queue (Files are
> reprocessed)
> 
>> But the basic problem is that, in order to be able to delete child
> Am I misunderstanding or you are thinking that
>       if I delete FileA then I must delete DocA1..DocAN ?
> because I don't need this.
> 
> thanks
> 
> 
> -- 
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
> 
> Il giorno 02/lug/2014, alle ore 12:57, Karl Wright ha scritto:
> 
>> Hi Matteo,
>> 
>> I looked a bit into what it would require to be able to index multiple
>> documents given a single parent document, WITHOUT the child documents
>> hitting the ManifoldCF document queue.
>> 
>> What is not controversial is that the IProcessActivity.ingestDocument()
>> method would change so that *two* document identifiers were passed in.  The
>> first would be the parent document identifier, and the second would be a
>> child document identifier, which you would make up presumably based on the
>> parent as a starting point.  This would be a requirement for the
>> incremental indexer.
>> 
>> But the basic problem is that, in order to be able to delete child
>> documents from the index without involving the repository connector, we
>> need to relate parent documents with child documents in some way, inside
>> the incremental indexer.  There are a number of possible ways of doing
>> this; the simplest would be to just add another column to the ingeststatus
>> table which would allow the separation of parent and child document
>> identifiers.  However, the simple solution is not very good because it
>> greatly exacerbates a problem which we already have in the incremental
>> indexer: there are multiple copies of the document version string being
>> kept, one for each record.  Also, there is currently no logic at all in
>> place to deal with the situation where the list of child documents shrinks;
>> that logic would have to be worked out and there would need to be tracking
>> to identify records that needed to go away as a result.
>> 
>> In short, this would be a significant change -- which is OK, but before
>> considering it I'd have to work it through carefully, and make sure we
>> don't lose performance etc.
>> 
>> Thanks,
>> Karl
>> 
>> 
>> 
>> On Tue, Jul 1, 2014 at 12:12 PM, Matteo Grolla <[email protected]>
>> wrote:
>> 
>>> Hi Karl,
>>>       first of all thanks
>>> 
>>>> The reason MCF is not currently structured this way is because a decision
>>> I think that in general MCF design is sound and generic.
>>> As connector developer I'd just like to have more flexibility in
>>> particular situations.
>>> Maybe what I'm searching for is already there or wouldn't be disruptive to
>>> introduce.
>>> Mail exchange doesn't make this kind of discussion easy
>>> To make precise proposals I should probably give a detailed look at the
>>> framework source code
>>> 
>>> --
>>> Matteo Grolla
>>> Sourcesense - making sense of Open Source
>>> http://www.sourcesense.com
>>> 
>>> Il giorno 01/lug/2014, alle ore 16:47, Karl Wright ha scritto:
>>> 
>>>> Hi Matteo,
>>>> 
>>>> Ok, from your description it sounds like what you primarily want is for
>>> the
>>>> processing of one document to generate N index entries, each with its own
>>>> version string etc.  This would never go near the queue since effectively
>>>> you'd only be dealing with the large files there (FileA and FileB in your
>>>> example).  You are planning to get away with doing no incremental
>>>> management because you will simply repeat yourself if something goes
>>> wrong
>>>> in the middle and document processing is not completed.
>>>> 
>>>> The reason MCF is not currently structured this way is because a decision
>>>> needs to be made *up front* whether to process the document or not, and
>>>> that cannot be done in your model without actually fetching and
>>> processing
>>>> the large file.  So it is in fact a chicken-and-egg problem.  I will
>>> think
>>>> if I can see a solution to it but I've considered this in the past and
>>> not
>>>> found a good way to structure this kind of arrangement.  Indeed,
>>> carry-down
>>>> was designed in part to solve this problem.
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jul 1, 2014 at 10:36 AM, Matteo Grolla <[email protected]
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Karl
>>>>> 
>>>>> I read the book and in general all principles seems sound.
>>>>> What I'm thinking is that for some specific connectors (to say in
>>> specific
>>>>> conditions) people may want to exploit the specificity taking different
>>>>> approaches
>>>>> 
>>>>>> I don't think database management is as difficult as you seem to think.
>>>>> 
>>>>> Maybe I wasn't clear on this but what I mean is this:
>>>>> If I propose to my typical customer
>>>>> a crawler that requires postgres even for simple crawls they's probably
>>>>> prefer to write a custom app for simple crawls.
>>>>> If I could at least say that the db doesn't grow a lot that would
>>> mitigate
>>>>> the problem.
>>>>> I don't know if I'm the only one with these problem
>>>>> 
>>>>> 
>>>>>> You have forgotten what happens when either errors occur during
>>>>> processing,
>>>>>> or the agents process working on your documents dies
>>>>> 
>>>>> Let's say fileA
>>>>> contains DocA1.. DocA100
>>>>> 
>>>>> As expressed in the comment on carry down data:
>>>>> if I have errors or the crawler dies in the processing of DocA50
>>>>> since I want FileA to be considered processed only when all its docs
>>> have
>>>>> been processed
>>>>> at restart the system should
>>>>> reparse FileA
>>>>> skip DocA1..DocA49 (if I'm handling versioning for them)
>>>>> process DocA50..DocA100
>>>>> 
>>>>> If there's a failure I have to reparse FileA but I avoid storing 100
>>> docs
>>>>> in the db.
>>>>> For me that's good, failures are not so frequent.
>>>>> 
>>>>>> You are forgetting the fact that MCF is incremental.
>>>>> Let's say:
>>>>> 
>>>>> in the first crawl MCF processes
>>>>> FileA dated 2014-01-01
>>>>> containing Doc1..Doc10
>>>>> all docs are versioned  2014-01-01
>>>>> 
>>>>> in the second crawl
>>>>> FileB dated 2014-01-02
>>>>> containing Doc1..Doc5
>>>>> all docs are versioned  2014-01-02
>>>>> 
>>>>> so Doc1..Doc5 are overwritten with data from fileB
>>>>> I don't need carry down data from previous crawl
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Matteo Grolla
>>>>> Sourcesense - making sense of Open Source
>>>>> http://www.sourcesense.com
>>>>> 
>>>>> Il giorno 01/lug/2014, alle ore 15:04, Karl Wright ha scritto:
>>>>> 
>>>>>> Hi Matteo,
>>>>>> 
>>>>>> I don't think database management is as difficult as you seem to think.
>>>>>> But more importantly, you seem to have issues with very basic aspects
>>> of
>>>>>> the ManifoldCF design.  It may help to read the book (see
>>>>>> https://manifoldcf.apache.org/en_US/books-and-presentations.html), to
>>>>>> understand more completely where the design decisions came from.
>>>>>> 
>>>>>> In short, ManifoldCF is built on top of a database for the following
>>>>>> reasons:
>>>>>> - resilience
>>>>>> - restartability
>>>>>> - control of memory footprint
>>>>>> - synchronization
>>>>>> 
>>>>>> It was, in fact, designed around the capabilities of databases (not
>>> just
>>>>>> postgresql, but all modern databases), so it is not surprising that it
>>>>> uses
>>>>>> the database for everything persistent at all.  Incremental crawling
>>>>> means
>>>>>> that even more things need to be persistent in ManifoldCF than they
>>> might
>>>>>> in other crawling designs.
>>>>>> 
>>>>>> So I suggest that you carefully read the first chapter of MCF in
>>> Action,
>>>>>> and consider the design point of this crawler carefully.
>>>>>> 
>>>>>> As for your specific questions:
>>>>>> 
>>>>>> 'I can't say: "before processing another file process all the docs from
>>>>>> previous files"' - Yes, and the reason for that is because the MCF
>>> queue
>>>>> is
>>>>>> built once again in the database.  Documents that are to be processed
>>>>> must
>>>>>> be queried, and that query must be performant.
>>>>>> 
>>>>>> 'do I need intra cluster synchronization to process the docs contained
>>>>> in a
>>>>>> file?
>>>>>> If I state that the machine that processed the file is the one that
>>>>>> processes the docs contained in it then I don't.'
>>>>>> 
>>>>>> You have forgotten what happens when either errors occur during
>>>>> processing,
>>>>>> or the agents process working on your documents dies (because it got
>>>>>> killed, say).  Unless you want to lose all context, then you need
>>>>>> persistent storage.
>>>>>> 
>>>>>> 'If it's difficult to do without a db for carry down data I'd like that
>>>>>> table to remain small, maybe empty it at the end of every crawl.
>>>>>> How could I do that?'
>>>>>> 
>>>>>> You are forgetting the fact that MCF is incremental.  If you want it to
>>>>> do
>>>>>> the minimum work on a subsequent crawl, it has to keep track of what
>>>>> inputs
>>>>>> are around for each document to be processed.
>>>>>> 
>>>>>> Karl
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla <
>>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Karl,
>>>>>>>     glad it's appreciated.
>>>>>>> 
>>>>>>> concerning the answer:
>>>>>>> 
>>>>>>> My intention is not to avoid using the db, just limiting it's use to
>>>>> what
>>>>>>> strictly necessary.
>>>>>>> And surely I don't want to find new way to handle intra-cluster
>>>>>>> communication.
>>>>>>> 
>>>>>>> For me it would be ok to keep track of crawled docIds an versions in
>>> the
>>>>>>> db, I'd just like to avoid putting there carry down data.
>>>>>>> 
>>>>>>> Motivations:
>>>>>>> I recently worked on a couple of projects dealing about crawling data
>>>>> from
>>>>>>> external data sources and importing it in solr.
>>>>>>> One was implemented as a Lucidworks Search connector
>>>>>>> The other was a custom crawling app.
>>>>>>> In both cases the crawling part was simple, the data source was giving
>>>>> me
>>>>>>> all the modified / deleted documents  within a time interval.
>>>>>>> The processing pipeline to enrich and transform the documents was more
>>>>>>> involved.
>>>>>>> 
>>>>>>> In cases like these I'd like to just focus on sizing the crawler and
>>>>> solr
>>>>>>> instances.
>>>>>>> If I have to size a db I'll have to deal with a dba and many customers
>>>>> are
>>>>>>> not experienced on Postgres so mcf solution becomes less appealing.
>>>>>>> Even if I find a postgres dba  I'll have to deal with him for things
>>>>> like
>>>>>>> performance problems, size…
>>>>>>> All things I'd like to avoid if not strictly necessary
>>>>>>> 
>>>>>>> please correct me if I'm wrong in what follows
>>>>>>> Why do I need carry down data in db?
>>>>>>> because I wan't bounded memory usage and have no control on the order
>>>>> mcf
>>>>>>> follows in processing docs
>>>>>>> I can't say: "before processing another file process all the docs from
>>>>>>> previous files"
>>>>>>> do I need intra cluster synchronization to process the docs contained
>>>>> in a
>>>>>>> file?
>>>>>>> If I state that the machine that processed the file is the one that
>>>>>>> processes the docs contained in it then I don't.
>>>>>>> 
>>>>>>> What do you think?
>>>>>>> If it's difficult to do without a db for carry down data I'd like that
>>>>>>> table to remain small, maybe empty it at the end of every crawl.
>>>>>>> How could I do that?
>>>>>>> 
>>>>>>> If I were to synthesize this mail in one sentence I'd say:
>>>>>>> "Given simple crawling requirements I'd like o be able to implement an
>>>>> MCF
>>>>>>> solution that is performant and simple to manage"
>>>>>>> 
>>>>>>> thanks
>>>>>>> 
>>>>>>> --
>>>>>>> Matteo Grolla
>>>>>>> Sourcesense - making sense of Open Source
>>>>>>> http://www.sourcesense.com
>>>>>>> 
>>>>>>> Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto:
>>>>>>> 
>>>>>>>> Hi Matteo,
>>>>>>>> 
>>>>>>>> Thank you for the work you put in here.
>>>>>>>> 
>>>>>>>> I have a response to one particular question buried deep in your
>>> code:
>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>>      FIXME: carrydown values
>>>>>>>>        serializing the documents seems a quick way to reach the
>>> goal
>>>>>>>>        what is the size limit for this data in the sql table?
>>>>>>>> 
>>>>>>>>        PROPOSAL
>>>>>>>>        What I'd really like is avoiding the use of a db table for
>>>>>>>> carrydown data and keep them in memory
>>>>>>>>        something like:
>>>>>>>>          MCF starts processing File A
>>>>>>>>          docs A1, A2, ... AN are added to the queue
>>>>>>>>          MCF starts processing File B
>>>>>>>>          docs B1, B2, ... are added to the queue
>>>>>>>>          and so on...
>>>>>>>>          as soon as all docs A1..AN have been processed, A is
>>>>>>>> considered processed
>>>>>>>>          in case of failure (manifold is restarted in the middle
>>>>>>>> of a crawl)
>>>>>>>>            all files (A, B...) should be reprocessed
>>>>>>>>          the size of the queue should be bounded
>>>>>>>>            once filled MCF should stop processing files untill
>>>>>>>> more docs are processed
>>>>>>>> 
>>>>>>>>        MOTIVATION
>>>>>>>>        -I'd like to avoid putting pressure on the db if possible,
>>>>>>>> so that it doesn't become a concern in production
>>>>>>>>        -performance
>>>>>>>> <<<<<<
>>>>>>>> 
>>>>>>>> Carrydown is explicitly designed to use unlimited-length database
>>>>>>>> fields.  Your proposal would work OK only within a single cluster
>>>>>>>> member; however, among multiple cluster
>>>>>>>> members it could not work.  The database is the ManifoldCF medium of
>>>>>>>> choice for handling stateful information and for handling
>>>>>>>> cross-cluster data requirements.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <
>>> [email protected]
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>>    I wrote a repository connector for crawling solrxml files
>>>>>>>>> 
>>>>>>>>> https://github.com/matteogrolla/mcf-filesystem-xml-connector
>>>>>>>>> 
>>>>>>>>> The work is based on the filesystem connector but I made several
>>>>>>> hopefully
>>>>>>>>> interesting changes which could be applied elsewhere.
>>>>>>>>> I have also a couple of questions
>>>>>>>>> For details see the read me file
>>>>>>>>> 
>>>>>>>>> Matteo
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 

Reply via email to