Re: proposals for writing manifold connectors

Karl Wright Sun, 20 Jul 2014 08:25:24 -0700

Hi Matteo,

For what it is worth, I just committed these changes to trunk.  Please
check out the revised APIs.


Thanks!
Karl



On Tue, Jul 1, 2014 at 8:48 AM, Matteo Grolla <[email protected]>
wrote:

> Hi Karl,
>         glad it's appreciated.
>
> concerning the answer:
>
> My intention is not to avoid using the db, just limiting it's use to what
> strictly necessary.
> And surely I don't want to find new way to handle intra-cluster
> communication.
>
> For me it would be ok to keep track of crawled docIds an versions in the
> db, I'd just like to avoid putting there carry down data.
>
> Motivations:
> I recently worked on a couple of projects dealing about crawling data from
> external data sources and importing it in solr.
> One was implemented as a Lucidworks Search connector
> The other was a custom crawling app.
> In both cases the crawling part was simple, the data source was giving me
> all the modified / deleted documents  within a time interval.
> The processing pipeline to enrich and transform the documents was more
> involved.
>
> In cases like these I'd like to just focus on sizing the crawler and solr
> instances.
> If I have to size a db I'll have to deal with a dba and many customers are
> not experienced on Postgres so mcf solution becomes less appealing.
> Even if I find a postgres dba  I'll have to deal with him for things like
> performance problems, size…
> All things I'd like to avoid if not strictly necessary
>
> please correct me if I'm wrong in what follows
> Why do I need carry down data in db?
> because I wan't bounded memory usage and have no control on the order mcf
> follows in processing docs
> I can't say: "before processing another file process all the docs from
> previous files"
> do I need intra cluster synchronization to process the docs contained in a
> file?
> If I state that the machine that processed the file is the one that
> processes the docs contained in it then I don't.
>
> What do you think?
> If it's difficult to do without a db for carry down data I'd like that
> table to remain small, maybe empty it at the end of every crawl.
> How could I do that?
>
> If I were to synthesize this mail in one sentence I'd say:
> "Given simple crawling requirements I'd like o be able to implement an MCF
> solution that is performant and simple to manage"
>
> thanks
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto:
>
> > Hi Matteo,
> >
> > Thank you for the work you put in here.
> >
> > I have a response to one particular question buried deep in your code:
> >
> >>>>>>>
> >
> >          FIXME: carrydown values
> >            serializing the documents seems a quick way to reach the goal
> >            what is the size limit for this data in the sql table?
> >
> >            PROPOSAL
> >            What I'd really like is avoiding the use of a db table for
> > carrydown data and keep them in memory
> >            something like:
> >              MCF starts processing File A
> >              docs A1, A2, ... AN are added to the queue
> >              MCF starts processing File B
> >              docs B1, B2, ... are added to the queue
> >              and so on...
> >              as soon as all docs A1..AN have been processed, A is
> > considered processed
> >              in case of failure (manifold is restarted in the middle
> > of a crawl)
> >                all files (A, B...) should be reprocessed
> >              the size of the queue should be bounded
> >                once filled MCF should stop processing files untill
> > more docs are processed
> >
> >            MOTIVATION
> >            -I'd like to avoid putting pressure on the db if possible,
> > so that it doesn't become a concern in production
> >            -performance
> > <<<<<<
> >
> > Carrydown is explicitly designed to use unlimited-length database
> > fields.  Your proposal would work OK only within a single cluster
> > member; however, among multiple cluster
> > members it could not work.  The database is the ManifoldCF medium of
> > choice for handling stateful information and for handling
> > cross-cluster data requirements.
> >
> > Thanks,
> > Karl
> >
> >
> >
> >
> > On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <[email protected]>
> > wrote:
> >
> >> Hi,
> >>        I wrote a repository connector for crawling solrxml files
> >>
> >> https://github.com/matteogrolla/mcf-filesystem-xml-connector
> >>
> >> The work is based on the filesystem connector but I made several
> hopefully
> >> interesting changes which could be applied elsewhere.
> >> I have also a couple of questions
> >> For details see the read me file
> >>
> >> Matteo
>
>

Re: proposals for writing manifold connectors

Reply via email to