Hi Karl,
        glad it's appreciated.

concerning the answer:

My intention is not to avoid using the db, just limiting it's use to what 
strictly necessary.
And surely I don't want to find new way to handle intra-cluster communication.

For me it would be ok to keep track of crawled docIds an versions in the db, 
I'd just like to avoid putting there carry down data.

Motivations:
I recently worked on a couple of projects dealing about crawling data from 
external data sources and importing it in solr.
One was implemented as a Lucidworks Search connector
The other was a custom crawling app.
In both cases the crawling part was simple, the data source was giving me all 
the modified / deleted documents  within a time interval.
The processing pipeline to enrich and transform the documents was more involved.

In cases like these I'd like to just focus on sizing the crawler and solr 
instances.
If I have to size a db I'll have to deal with a dba and many customers are not 
experienced on Postgres so mcf solution becomes less appealing.
Even if I find a postgres dba  I'll have to deal with him for things like 
performance problems, size…
All things I'd like to avoid if not strictly necessary

please correct me if I'm wrong in what follows
Why do I need carry down data in db?
because I wan't bounded memory usage and have no control on the order mcf 
follows in processing docs
I can't say: "before processing another file process all the docs from previous 
files"
do I need intra cluster synchronization to process the docs contained in a file?
If I state that the machine that processed the file is the one that processes 
the docs contained in it then I don't.

What do you think?
If it's difficult to do without a db for carry down data I'd like that table to 
remain small, maybe empty it at the end of every crawl.
How could I do that?

If I were to synthesize this mail in one sentence I'd say:
"Given simple crawling requirements I'd like o be able to implement an MCF 
solution that is performant and simple to manage"

thanks

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 01/lug/2014, alle ore 12:16, Karl Wright ha scritto:

> Hi Matteo,
> 
> Thank you for the work you put in here.
> 
> I have a response to one particular question buried deep in your code:
> 
>>>>>>> 
> 
>          FIXME: carrydown values
>            serializing the documents seems a quick way to reach the goal
>            what is the size limit for this data in the sql table?
> 
>            PROPOSAL
>            What I'd really like is avoiding the use of a db table for
> carrydown data and keep them in memory
>            something like:
>              MCF starts processing File A
>              docs A1, A2, ... AN are added to the queue
>              MCF starts processing File B
>              docs B1, B2, ... are added to the queue
>              and so on...
>              as soon as all docs A1..AN have been processed, A is
> considered processed
>              in case of failure (manifold is restarted in the middle
> of a crawl)
>                all files (A, B...) should be reprocessed
>              the size of the queue should be bounded
>                once filled MCF should stop processing files untill
> more docs are processed
> 
>            MOTIVATION
>            -I'd like to avoid putting pressure on the db if possible,
> so that it doesn't become a concern in production
>            -performance
> <<<<<<
> 
> Carrydown is explicitly designed to use unlimited-length database
> fields.  Your proposal would work OK only within a single cluster
> member; however, among multiple cluster
> members it could not work.  The database is the ManifoldCF medium of
> choice for handling stateful information and for handling
> cross-cluster data requirements.
> 
> Thanks,
> Karl
> 
> 
> 
> 
> On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <[email protected]>
> wrote:
> 
>> Hi,
>>        I wrote a repository connector for crawling solrxml files
>> 
>> https://github.com/matteogrolla/mcf-filesystem-xml-connector
>> 
>> The work is based on the filesystem connector but I made several hopefully
>> interesting changes which could be applied elsewhere.
>> I have also a couple of questions
>> For details see the read me file
>> 
>> Matteo

Reply via email to