Hi Matteo, I can't see a way of supporting this model other than by extending the framework to support it. I've created a ticket (CONNECTORS-989). The analysis is included, and the costs are specified as well. Please comment in the ticket about this issue going forward.
Thanks, Karl On Wed, Jul 2, 2014 at 5:22 PM, Matteo Grolla <[email protected]> wrote: > Hi Karl, > what I'm saying is that his connector doesn't need them > and if this means that it can be implemented efficiently without modifying > MCF framework that would be great. > > -- > Matteo Grolla > Sourcesense - making sense of Open Source > http://www.sourcesense.com > > Il giorno 02/lug/2014, alle ore 20:44, Karl Wright ha scritto: > > Hi Matteo, > > You may not need document tracking and synchronization, but that is > mcf's major purpose, so any general feature would have to handle that > case. > > Karl > > Sent from my Windows Phone > From: Matteo Grolla > Sent: 7/2/2014 10:49 AM > To: [email protected] > Subject: Re: proposals for writing manifold connectors > Hi Karl, > one note (maybe obvious): > parent document (File) are not to be indexed in solr, I'm only > interested in keeping track in log reports > > What is not controversial is that the IProcessActivity.ingestDocument() > > tell me if I've got it why ingestDocument must change: > in case of crashes/errors when mcf restarts > if finds Docs to be processed in the queue and it has to know the > corresponding File to resume processing > > if I had a hook at the beginning of a crawl that would allow me to > remove all Docs instances from the queue (but leave the File > instances) > then I could just recreate the Docs instances in the queue (Files are > reprocessed) > > But the basic problem is that, in order to be able to delete child > > Am I misunderstanding or you are thinking that > if I delete FileA then I must delete DocA1..DocAN ? > because I don't need this. > > thanks > > > -- > Matteo Grolla > Sourcesense - making sense of Open Source > http://www.sourcesense.com > > Il giorno 02/lug/2014, alle ore 12:57, Karl Wright ha scritto: > > Hi Matteo, > > > I looked a bit into what it would require to be able to index multiple > > documents given a single parent document, WITHOUT the child documents > > hitting the ManifoldCF document queue. > > > What is not controversial is that the IProcessActivity.ingestDocument() > > method would change so that *two* document identifiers were passed in. The > > first would be the parent document identifier, and the second would be a > > child document identifier, which you would make up presumably based on the > > parent as a starting point. This would be a requirement for the > > incremental indexer. > > > But the basic problem is that, in order to be able to delete child > > documents from the index without involving the repository connector, we > > need to relate parent documents with child documents in some way, inside > > the incremental indexer. There are a number of possible ways of doing > > this; the simplest would be to just add another column to the ingeststatus > > table which would allow the separation of parent and child document > > identifiers. However, the simple solution is not very good because it > > greatly exacerbates a problem which we already have in the incremental > > indexer: there are multiple copies of the document version string being > > kept, one for each record. Also, there is currently no logic at all in > > place to deal with the situation where the list of child documents shrinks; > > that logic would have to be worked out and there would need to be tracking > > to identify records that needed to go away as a result. > > > In short, this would be a significant change -- which is OK, but before > > considering it I'd have to work it through carefully, and make sure we > > don't lose performance etc. > > > Thanks, > > Karl > > > > > On Tue, Jul 1, 2014 at 12:12 PM, Matteo Grolla <[email protected]> > > wrote: > > > Hi Karl, > > first of all thanks > > > The reason MCF is not currently structured this way is because a decision > > I think that in general MCF design is sound and generic. > > As connector developer I'd just like to have more flexibility in > > particular situations. > > Maybe what I'm searching for is already there or wouldn't be disruptive to > > introduce. > > Mail exchange doesn't make this kind of discussion easy > > To make precise proposals I should probably give a detailed look at the > > framework source code > > > -- > > Matteo Grolla > > Sourcesense - making sense of Open Source > > http://www.sourcesense.com > > > Il giorno 01/lug/2014, alle ore 16:47, Karl Wright ha scritto: > > > Hi Matteo, > > > Ok, from your description it sounds like what you primarily want is for > > the > > processing of one document to generate N index entries, each with its own > > version string etc. This would never go near the queue since effectively > > you'd only be dealing with the large files there (FileA and FileB in your > > example). You are planning to get away with doing no incremental > > management because you will simply repeat yourself if something goes > > wrong > > in the middle and document processing is not completed. > > > The reason MCF is not currently structured this way is because a decision > > needs to be made *up front* whether to process the document or not, and > > that cannot be done in your model without actually fetching and > > processing > > the large file. So it is in fact a chicken-and-egg problem. I will > > think > > if I can see a solution to it but I've considered this in the past and > > not > > found a good way to structure this kind of arrangement. Indeed, > > carry-down > > was designed in part to solve this problem. > > > Karl > > > > > On Tue, Jul 1, 2014 at 10:36 AM, Matteo Grolla <[email protected] > > > wrote: > > > Hi Karl > > > I read the book and in general all principles seems sound. > > What I'm thinking is that for some specific connectors (to say in > > specific > > conditions) people may want to exploit the specificity taking different > > approaches > > > I don't think database management is as difficult as you seem to think. > > > Maybe I wasn't clear on this but what I mean is this: > > If I propose to my typical customer > > a crawler that requires postgres even for simple crawls they's probably > > prefer to write a custom app for simple crawls. > > If I could at least say that the db doesn't grow a lot that would > > mitigate > > the problem. > > I don't know if I'm the only one with these problem > > > > You have forgotten what happens when either errors occur during > > processing, > > or the agents process working on your documents dies > > > Let's say fileA > > contains DocA1.. DocA100 > > > As expressed in the comment on carry down data: > > if I have errors or the crawler dies in the processing of DocA50 > > since I want FileA to be considered processed only when all its docs > > have > > been processed > > at restart the system should > > reparse FileA > > skip DocA1..DocA49 (if I'm handling versioning for them) > > process DocA50..DocA100 > > > If there's a failure I have to reparse FileA but I avoid storing 100 > > docs > > in the db. > > For me that's good, failures are not so frequent. > > > You are forgetting the fact that MCF is incremental. > > Let's say: > > > in the first crawl MCF processes > > FileA dated 2014-01-01 > > containing Doc1..Doc10 > > all docs are versioned 2014-01-01 > > > in the second crawl > > FileB dated 2014-01-02 > > containing Doc1..Doc5 > > all docs are versioned 2014-01-02 > > > so Doc1..Doc5 are overwritten with data from fileB > > I don't need carry down data from previous crawl > > > > > > > > > > -- > > Matteo Grolla > > Sourcesense - making sense of Open Source > > http://www.sourcesense.com > > > Il giorno 01/lug/2014, alle ore 15:04, Karl Wright ha scritto: > > > Hi Matteo, > > > I don't think database management is as difficult as you seem to think. > > But more importantly, you seem to have issues with very basic aspects > > of > > the ManifoldCF design. It may help to read the book (see > > https://manifoldcf.apache.org/en_US/books-and-presentations.html), to > > understand more completely where the design decisions came from. > > > In short, ManifoldCF is built on top of a database for the following > > reasons: > > - resilience > > - restartability > > - control of memory footprint > > - synchronization > > > It was, in fact, designed around the capabilities of databases (not > > just > > postgresql, but all modern databases), so it is not surprising that it > > uses > > the database for everything persistent at all. Incremental crawling > > means > > that even more things need to be persistent in ManifoldCF than they > > might > > in other crawling designs. > > > So I suggest that you carefully read the first chapter of MCF in > > Action, > > and consider the design point of this crawler carefully. > > > As for your specific questions: > > > 'I can't say: "before processing another file process all the docs from > > previous files"' - Yes, and the reason for that is because the MCF > > queue > > is > > built once again in the database. Documents that are to be processed > > must > > be queried, and that query must be performant. > > > 'do I need intra cluster synchronization to process the docs contained > > in a > > file? > > If I state that the machine that processed the file is the one that > > processes the docs contained in it then I don't.' > > > You have forgotten what happens when either errors occur during > > processing, > > or the agents process working on your documents dies (because it got > > killed, say). Unless you want to lose all context, then you need > > persistent storage. > > > 'If it's difficult to do without a db for carry down data I'd like that > > table to remain small, maybe empty it at the end of every crawl. > > How could I do that?' > > > You are forgetting the fact that MCF is incremental. If you want it to > > do > > the minimum work on a subsequent crawl, it has to keep track of what > > inputs > > are around for each document to be processed. > > > Karl > > > > > On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla < > > [email protected]> > > wrote: > > > Hi Karl, > > glad it's appreciated. > > > concerning the answer: > > > My intention is not to avoid using the db, just limiting it's use to > > what > > strictly necessary. > > And surely I don't want to find new way to handle intra-cluster > > communication. > > > For me it would be ok to keep track of crawled docIds an versions in > > the > > db, I'd just like to avoid putting there carry down data. > > > Motivations: > > I recently worked on a couple of projects dealing about crawling data > > from > > external data sources and importing it in solr. > > One was implemented as a Lucidworks Search connector > > The other was a custom crawling app. > > In both cases the crawling part was simple, the data source was giving > > me > > all the modified / deleted documents within a time interval. > > The processing pipeline to enrich and transform the documents was more > > involved. > > > In cases like these I'd like to just focus on sizing the crawler and > > solr > > instances. > > If I have to size a db I'll have to deal with a dba and many customers > > are > > not experienced on Postgres so mcf solution becomes less appealing. > > Even if I find a postgres dba I'll have to deal with him for things > > like > > performance problems, size⦠> > All things I'd like to avoid if not strictly necessary > > > please correct me if I'm wrong in what follows > > Why do I need carry down data in db? > > because I wan't bounded memory usage and have no control on the order > > mcf > > follows in processing docs > > I can't say: "before processing another file process all the docs from > > previous files" > > do I need intra cluster synchronization to process the docs contained > > in a > > file? > > If I state that the machine that processed the file is the one that > > processes the docs contained in it then I don't. > > > What do you think? > > If it's difficult to do without a db for carry down data I'd like that > > table to remain small, maybe empty it at the end of every crawl. > > How could I do that? > > > If I were to synthesize this mail in one sentence I'd say: > > "Given simple crawling requirements I'd like o be able to implement an > > MCF > > solution that is performant and simple to manage" > > > thanks > > > -- > > Matteo Grolla > > Sourcesense - making sense of Open Source > > http://www.sourcesense.com > > > Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto: > > > Hi Matteo, > > > Thank you for the work you put in here. > > > I have a response to one particular question buried deep in your > > code: > > > > > FIXME: carrydown values > > serializing the documents seems a quick way to reach the > > goal > > what is the size limit for this data in the sql table? > > > PROPOSAL > > What I'd really like is avoiding the use of a db table for > > carrydown data and keep them in memory > > something like: > > MCF starts processing File A > > docs A1, A2, ... AN are added to the queue > > MCF starts processing File B > > docs B1, B2, ... are added to the queue > > and so on... > > as soon as all docs A1..AN have been processed, A is > > considered processed > > in case of failure (manifold is restarted in the middle > > of a crawl) > > all files (A, B...) should be reprocessed > > the size of the queue should be bounded > > once filled MCF should stop processing files untill > > more docs are processed > > > MOTIVATION > > -I'd like to avoid putting pressure on the db if possible, > > so that it doesn't become a concern in production > > -performance > > <<<<<< > > > Carrydown is explicitly designed to use unlimited-length database > > fields. Your proposal would work OK only within a single cluster > > member; however, among multiple cluster > > members it could not work. The database is the ManifoldCF medium of > > choice for handling stateful information and for handling > > cross-cluster data requirements. > > > Thanks, > > Karl > > > > > > On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla < > > [email protected] > > > wrote: > > > Hi, > > I wrote a repository connector for crawling solrxml files > > > https://github.com/matteogrolla/mcf-filesystem-xml-connector > > > The work is based on the filesystem connector but I made several > > hopefully > > interesting changes which could be applied elsewhere. > > I have also a couple of questions > > For details see the read me file > > > Matteo > > > > > > > > >
