Hi Matteo, For what it is worth, I just committed these changes to trunk. Please check out the revised APIs.
Thanks! Karl On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla <[email protected]> wrote: > Hi Karl, > glad it's appreciated. > > concerning the answer: > > My intention is not to avoid using the db, just limiting it's use to what > strictly necessary. > And surely I don't want to find new way to handle intra-cluster > communication. > > For me it would be ok to keep track of crawled docIds an versions in the > db, I'd just like to avoid putting there carry down data. > > Motivations: > I recently worked on a couple of projects dealing about crawling data from > external data sources and importing it in solr. > One was implemented as a Lucidworks Search connector > The other was a custom crawling app. > In both cases the crawling part was simple, the data source was giving me > all the modified / deleted documents within a time interval. > The processing pipeline to enrich and transform the documents was more > involved. > > In cases like these I'd like to just focus on sizing the crawler and solr > instances. > If I have to size a db I'll have to deal with a dba and many customers are > not experienced on Postgres so mcf solution becomes less appealing. > Even if I find a postgres dba I'll have to deal with him for things like > performance problems, size⦠> All things I'd like to avoid if not strictly necessary > > please correct me if I'm wrong in what follows > Why do I need carry down data in db? > because I wan't bounded memory usage and have no control on the order mcf > follows in processing docs > I can't say: "before processing another file process all the docs from > previous files" > do I need intra cluster synchronization to process the docs contained in a > file? > If I state that the machine that processed the file is the one that > processes the docs contained in it then I don't. > > What do you think? > If it's difficult to do without a db for carry down data I'd like that > table to remain small, maybe empty it at the end of every crawl. > How could I do that? > > If I were to synthesize this mail in one sentence I'd say: > "Given simple crawling requirements I'd like o be able to implement an MCF > solution that is performant and simple to manage" > > thanks > > -- > Matteo Grolla > Sourcesense - making sense of Open Source > http://www.sourcesense.com > > Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto: > > > Hi Matteo, > > > > Thank you for the work you put in here. > > > > I have a response to one particular question buried deep in your code: > > > >>>>>>> > > > > FIXME: carrydown values > > serializing the documents seems a quick way to reach the goal > > what is the size limit for this data in the sql table? > > > > PROPOSAL > > What I'd really like is avoiding the use of a db table for > > carrydown data and keep them in memory > > something like: > > MCF starts processing File A > > docs A1, A2, ... AN are added to the queue > > MCF starts processing File B > > docs B1, B2, ... are added to the queue > > and so on... > > as soon as all docs A1..AN have been processed, A is > > considered processed > > in case of failure (manifold is restarted in the middle > > of a crawl) > > all files (A, B...) should be reprocessed > > the size of the queue should be bounded > > once filled MCF should stop processing files untill > > more docs are processed > > > > MOTIVATION > > -I'd like to avoid putting pressure on the db if possible, > > so that it doesn't become a concern in production > > -performance > > <<<<<< > > > > Carrydown is explicitly designed to use unlimited-length database > > fields. Your proposal would work OK only within a single cluster > > member; however, among multiple cluster > > members it could not work. The database is the ManifoldCF medium of > > choice for handling stateful information and for handling > > cross-cluster data requirements. > > > > Thanks, > > Karl > > > > > > > > > > On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <[email protected]> > > wrote: > > > >> Hi, > >> I wrote a repository connector for crawling solrxml files > >> > >> https://github.com/matteogrolla/mcf-filesystem-xml-connector > >> > >> The work is based on the filesystem connector but I made several > hopefully > >> interesting changes which could be applied elsewhere. > >> I have also a couple of questions > >> For details see the read me file > >> > >> Matteo > >
