Re: proposals for writing manifold connectors

Karl Wright Thu, 03 Jul 2014 05:10:27 -0700

Hi Matteo,

I can't see a way of supporting this model other than by extending the
framework to support it.  I've created a ticket (CONNECTORS-989).  The
analysis is included, and the costs are specified as well.  Please comment
in the ticket about this issue going forward.


Thanks,
Karl





On Wed, Jul 2, 2014 at 5:22 PM, Matteo Grolla <[email protected]>
wrote:

> Hi Karl,
> what I'm saying is that his connector doesn't need them
> and if this means that it can be implemented efficiently without modifying
> MCF framework that would be great.
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 02/lug/2014, alle ore 20:44, Karl Wright ha scritto:
>
> Hi Matteo,
>
> You may not need document tracking and synchronization, but that is
> mcf's major purpose, so any general feature would have to handle that
> case.
>
> Karl
>
> Sent from my Windows Phone
> From: Matteo Grolla
> Sent: 7/2/2014 10:49 AM
> To: [email protected]
> Subject: Re: proposals for writing manifold connectors
> Hi Karl,
> one note (maybe obvious):
> parent document (File) are not to be indexed in solr, I'm only
> interested in keeping track in log reports
>
> What is not controversial is that the IProcessActivity.ingestDocument()
>
> tell me if I've got it why ingestDocument must change:
> in case of crashes/errors when mcf restarts
> if finds Docs to be processed in the queue and it has to know the
> corresponding File to resume processing
>
> if I had a hook at the beginning of a crawl that would allow me to
> remove all Docs instances from the queue (but leave the File
> instances)
> then I could just recreate the Docs instances in the queue (Files are
> reprocessed)
>
> But the basic problem is that, in order to be able to delete child
>
> Am I misunderstanding or you are thinking that
> if I delete FileA then I must delete DocA1..DocAN ?
> because I don't need this.
>
> thanks
>
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 02/lug/2014, alle ore 12:57, Karl Wright ha scritto:
>
> Hi Matteo,
>
>
> I looked a bit into what it would require to be able to index multiple
>
> documents given a single parent document, WITHOUT the child documents
>
> hitting the ManifoldCF document queue.
>
>
> What is not controversial is that the IProcessActivity.ingestDocument()
>
> method would change so that *two* document identifiers were passed in.  The
>
> first would be the parent document identifier, and the second would be a
>
> child document identifier, which you would make up presumably based on the
>
> parent as a starting point.  This would be a requirement for the
>
> incremental indexer.
>
>
> But the basic problem is that, in order to be able to delete child
>
> documents from the index without involving the repository connector, we
>
> need to relate parent documents with child documents in some way, inside
>
> the incremental indexer.  There are a number of possible ways of doing
>
> this; the simplest would be to just add another column to the ingeststatus
>
> table which would allow the separation of parent and child document
>
> identifiers.  However, the simple solution is not very good because it
>
> greatly exacerbates a problem which we already have in the incremental
>
> indexer: there are multiple copies of the document version string being
>
> kept, one for each record.  Also, there is currently no logic at all in
>
> place to deal with the situation where the list of child documents shrinks;
>
> that logic would have to be worked out and there would need to be tracking
>
> to identify records that needed to go away as a result.
>
>
> In short, this would be a significant change -- which is OK, but before
>
> considering it I'd have to work it through carefully, and make sure we
>
> don't lose performance etc.
>
>
> Thanks,
>
> Karl
>
>
>
>
> On Tue, Jul 1, 2014 at 12:12 PM, Matteo Grolla <[email protected]>
>
> wrote:
>
>
> Hi Karl,
>
>       first of all thanks
>
>
> The reason MCF is not currently structured this way is because a decision
>
> I think that in general MCF design is sound and generic.
>
> As connector developer I'd just like to have more flexibility in
>
> particular situations.
>
> Maybe what I'm searching for is already there or wouldn't be disruptive to
>
> introduce.
>
> Mail exchange doesn't make this kind of discussion easy
>
> To make precise proposals I should probably give a detailed look at the
>
> framework source code
>
>
> --
>
> Matteo Grolla
>
> Sourcesense - making sense of Open Source
>
> http://www.sourcesense.com
>
>
> Il giorno 01/lug/2014, alle ore 16:47, Karl Wright ha scritto:
>
>
> Hi Matteo,
>
>
> Ok, from your description it sounds like what you primarily want is for
>
> the
>
> processing of one document to generate N index entries, each with its own
>
> version string etc.  This would never go near the queue since effectively
>
> you'd only be dealing with the large files there (FileA and FileB in your
>
> example).  You are planning to get away with doing no incremental
>
> management because you will simply repeat yourself if something goes
>
> wrong
>
> in the middle and document processing is not completed.
>
>
> The reason MCF is not currently structured this way is because a decision
>
> needs to be made *up front* whether to process the document or not, and
>
> that cannot be done in your model without actually fetching and
>
> processing
>
> the large file.  So it is in fact a chicken-and-egg problem.  I will
>
> think
>
> if I can see a solution to it but I've considered this in the past and
>
> not
>
> found a good way to structure this kind of arrangement.  Indeed,
>
> carry-down
>
> was designed in part to solve this problem.
>
>
> Karl
>
>
>
>
> On Tue, Jul 1, 2014 at 10:36 AM, Matteo Grolla <[email protected]
>
>
> wrote:
>
>
> Hi Karl
>
>
> I read the book and in general all principles seems sound.
>
> What I'm thinking is that for some specific connectors (to say in
>
> specific
>
> conditions) people may want to exploit the specificity taking different
>
> approaches
>
>
> I don't think database management is as difficult as you seem to think.
>
>
> Maybe I wasn't clear on this but what I mean is this:
>
> If I propose to my typical customer
>
> a crawler that requires postgres even for simple crawls they's probably
>
> prefer to write a custom app for simple crawls.
>
> If I could at least say that the db doesn't grow a lot that would
>
> mitigate
>
> the problem.
>
> I don't know if I'm the only one with these problem
>
>
>
> You have forgotten what happens when either errors occur during
>
> processing,
>
> or the agents process working on your documents dies
>
>
> Let's say fileA
>
> contains DocA1.. DocA100
>
>
> As expressed in the comment on carry down data:
>
> if I have errors or the crawler dies in the processing of DocA50
>
> since I want FileA to be considered processed only when all its docs
>
> have
>
> been processed
>
> at restart the system should
>
> reparse FileA
>
> skip DocA1..DocA49 (if I'm handling versioning for them)
>
> process DocA50..DocA100
>
>
> If there's a failure I have to reparse FileA but I avoid storing 100
>
> docs
>
> in the db.
>
> For me that's good, failures are not so frequent.
>
>
> You are forgetting the fact that MCF is incremental.
>
> Let's say:
>
>
> in the first crawl MCF processes
>
> FileA dated 2014-01-01
>
> containing Doc1..Doc10
>
> all docs are versioned  2014-01-01
>
>
> in the second crawl
>
> FileB dated 2014-01-02
>
> containing Doc1..Doc5
>
> all docs are versioned  2014-01-02
>
>
> so Doc1..Doc5 are overwritten with data from fileB
>
> I don't need carry down data from previous crawl
>
>
>
>
>
>
>
>
>
> --
>
> Matteo Grolla
>
> Sourcesense - making sense of Open Source
>
> http://www.sourcesense.com
>
>
> Il giorno 01/lug/2014, alle ore 15:04, Karl Wright ha scritto:
>
>
> Hi Matteo,
>
>
> I don't think database management is as difficult as you seem to think.
>
> But more importantly, you seem to have issues with very basic aspects
>
> of
>
> the ManifoldCF design.  It may help to read the book (see
>
> https://manifoldcf.apache.org/en_US/books-and-presentations.html), to
>
> understand more completely where the design decisions came from.
>
>
> In short, ManifoldCF is built on top of a database for the following
>
> reasons:
>
> - resilience
>
> - restartability
>
> - control of memory footprint
>
> - synchronization
>
>
> It was, in fact, designed around the capabilities of databases (not
>
> just
>
> postgresql, but all modern databases), so it is not surprising that it
>
> uses
>
> the database for everything persistent at all.  Incremental crawling
>
> means
>
> that even more things need to be persistent in ManifoldCF than they
>
> might
>
> in other crawling designs.
>
>
> So I suggest that you carefully read the first chapter of MCF in
>
> Action,
>
> and consider the design point of this crawler carefully.
>
>
> As for your specific questions:
>
>
> 'I can't say: "before processing another file process all the docs from
>
> previous files"' - Yes, and the reason for that is because the MCF
>
> queue
>
> is
>
> built once again in the database.  Documents that are to be processed
>
> must
>
> be queried, and that query must be performant.
>
>
> 'do I need intra cluster synchronization to process the docs contained
>
> in a
>
> file?
>
> If I state that the machine that processed the file is the one that
>
> processes the docs contained in it then I don't.'
>
>
> You have forgotten what happens when either errors occur during
>
> processing,
>
> or the agents process working on your documents dies (because it got
>
> killed, say).  Unless you want to lose all context, then you need
>
> persistent storage.
>
>
> 'If it's difficult to do without a db for carry down data I'd like that
>
> table to remain small, maybe empty it at the end of every crawl.
>
> How could I do that?'
>
>
> You are forgetting the fact that MCF is incremental.  If you want it to
>
> do
>
> the minimum work on a subsequent crawl, it has to keep track of what
>
> inputs
>
> are around for each document to be processed.
>
>
> Karl
>
>
>
>
> On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla <
>
> [email protected]>
>
> wrote:
>
>
> Hi Karl,
>
>     glad it's appreciated.
>
>
> concerning the answer:
>
>
> My intention is not to avoid using the db, just limiting it's use to
>
> what
>
> strictly necessary.
>
> And surely I don't want to find new way to handle intra-cluster
>
> communication.
>
>
> For me it would be ok to keep track of crawled docIds an versions in
>
> the
>
> db, I'd just like to avoid putting there carry down data.
>
>
> Motivations:
>
> I recently worked on a couple of projects dealing about crawling data
>
> from
>
> external data sources and importing it in solr.
>
> One was implemented as a Lucidworks Search connector
>
> The other was a custom crawling app.
>
> In both cases the crawling part was simple, the data source was giving
>
> me
>
> all the modified / deleted documents  within a time interval.
>
> The processing pipeline to enrich and transform the documents was more
>
> involved.
>
>
> In cases like these I'd like to just focus on sizing the crawler and
>
> solr
>
> instances.
>
> If I have to size a db I'll have to deal with a dba and many customers
>
> are
>
> not experienced on Postgres so mcf solution becomes less appealing.
>
> Even if I find a postgres dba  I'll have to deal with him for things
>
> like
>
> performance problems, size…
>
> All things I'd like to avoid if not strictly necessary
>
>
> please correct me if I'm wrong in what follows
>
> Why do I need carry down data in db?
>
> because I wan't bounded memory usage and have no control on the order
>
> mcf
>
> follows in processing docs
>
> I can't say: "before processing another file process all the docs from
>
> previous files"
>
> do I need intra cluster synchronization to process the docs contained
>
> in a
>
> file?
>
> If I state that the machine that processed the file is the one that
>
> processes the docs contained in it then I don't.
>
>
> What do you think?
>
> If it's difficult to do without a db for carry down data I'd like that
>
> table to remain small, maybe empty it at the end of every crawl.
>
> How could I do that?
>
>
> If I were to synthesize this mail in one sentence I'd say:
>
> "Given simple crawling requirements I'd like o be able to implement an
>
> MCF
>
> solution that is performant and simple to manage"
>
>
> thanks
>
>
> --
>
> Matteo Grolla
>
> Sourcesense - making sense of Open Source
>
> http://www.sourcesense.com
>
>
> Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto:
>
>
> Hi Matteo,
>
>
> Thank you for the work you put in here.
>
>
> I have a response to one particular question buried deep in your
>
> code:
>
>
>
>
>      FIXME: carrydown values
>
>        serializing the documents seems a quick way to reach the
>
> goal
>
>        what is the size limit for this data in the sql table?
>
>
>        PROPOSAL
>
>        What I'd really like is avoiding the use of a db table for
>
> carrydown data and keep them in memory
>
>        something like:
>
>          MCF starts processing File A
>
>          docs A1, A2, ... AN are added to the queue
>
>          MCF starts processing File B
>
>          docs B1, B2, ... are added to the queue
>
>          and so on...
>
>          as soon as all docs A1..AN have been processed, A is
>
> considered processed
>
>          in case of failure (manifold is restarted in the middle
>
> of a crawl)
>
>            all files (A, B...) should be reprocessed
>
>          the size of the queue should be bounded
>
>            once filled MCF should stop processing files untill
>
> more docs are processed
>
>
>        MOTIVATION
>
>        -I'd like to avoid putting pressure on the db if possible,
>
> so that it doesn't become a concern in production
>
>        -performance
>
> <<<<<<
>
>
> Carrydown is explicitly designed to use unlimited-length database
>
> fields.  Your proposal would work OK only within a single cluster
>
> member; however, among multiple cluster
>
> members it could not work.  The database is the ManifoldCF medium of
>
> choice for handling stateful information and for handling
>
> cross-cluster data requirements.
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <
>
> [email protected]
>
>
> wrote:
>
>
> Hi,
>
>    I wrote a repository connector for crawling solrxml files
>
>
> https://github.com/matteogrolla/mcf-filesystem-xml-connector
>
>
> The work is based on the filesystem connector but I made several
>
> hopefully
>
> interesting changes which could be applied elsewhere.
>
> I have also a couple of questions
>
> For details see the read me file
>
>
> Matteo
>
>
>
>
>
>
>
>
>

Re: proposals for writing manifold connectors

Reply via email to