Hi Matteo,
Thank you for the work you put in here.
I have a response to one particular question buried deep in your code:
>>>>>>
FIXME: carrydown values
serializing the documents seems a quick way to reach the goal
what is the size limit for this data in the sql table?
PROPOSAL
What I'd really like is avoiding the use of a db table for
carrydown data and keep them in memory
something like:
MCF starts processing File A
docs A1, A2, ... AN are added to the queue
MCF starts processing File B
docs B1, B2, ... are added to the queue
and so on...
as soon as all docs A1..AN have been processed, A is
considered processed
in case of failure (manifold is restarted in the middle
of a crawl)
all files (A, B...) should be reprocessed
the size of the queue should be bounded
once filled MCF should stop processing files untill
more docs are processed
MOTIVATION
-I'd like to avoid putting pressure on the db if possible,
so that it doesn't become a concern in production
-performance
<<<<<<
Carrydown is explicitly designed to use unlimited-length database
fields. Your proposal would work OK only within a single cluster
member; however, among multiple cluster
members it could not work. The database is the ManifoldCF medium of
choice for handling stateful information and for handling
cross-cluster data requirements.
Thanks,
Karl
On Tue, Jul 1, 2014 at 6:07 AM, Matteo Grolla <[email protected]>
wrote:
> Hi,
> I wrote a repository connector for crawling solrxml files
>
> https://github.com/matteogrolla/mcf-filesystem-xml-connector
>
> The work is based on the filesystem connector but I made several hopefully
> interesting changes which could be applied elsewhere.
> I have also a couple of questions
> For details see the read me file
>
> Matteo