[ 
https://issues.apache.org/jira/browse/CONNECTORS-13?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774407#comment-13774407
 ] 

Karl Wright commented on CONNECTORS-13:
---------------------------------------

Hi Graeme,

ManifoldCF relies on the database for scalability/reliability today, in large 
part.  Most database vendors provide versions of their product that can provide 
some degree of both properties; not sure where Postgres is in this dimension, 
but you can be sure they are working on it.  Also, key-value stores like 
Voldemort have this as their primary goal.  See CONNECTORS-286.  If you buy 
this direction of evolution, it's pointless for ManifoldCF to try to develop 
cross-cluster job queues, and all that entails, without considering what the 
database can do for us first.

As for scalability/replicability of ManifoldCF itself, this is less important, 
but even so there are issues.  For example, today the *Stuffer threads 
basically know they are unique; this permits them to avoid throwing huge 
numbers of locks on every iteration.  Those cannot therefore be replicated at 
this time.  A good deal of experimentation and/or potential schema changes will 
be needed to solve this issue - or, possibly, using outside locks to prevent 
more than one stuffing operation from proceeding at the same time.  Also, 
throttling is meaningless if there are multiple independent ManifoldCF 
instances in different processes that are unaware of each other - throttling 
would therefore need to be handled in a way that used the services of 
ILockManager rather than native synchronizers in order to be scalable in the 
right way.  I don't know if either of these has been written up into a ticket 
yet.

So CONNECTORS-13 is necessary but not sufficient for doing what you are look 
for.

                
> We should move to eliminate process synchronization via shared file system, 
> and use a process/service instead
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-13
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-13
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework core
>    Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
>            Reporter: Karl Wright
>             Fix For: ManifoldCF next
>
>
> The current implementation relies on the file system to synchronize activity 
> between various LCF processes.  This has several downsides: first, it is 
> possible to get the file system into a state that is corrupted (by killing 
> processes); second, this limits the future ability to spread crawler workload 
> over multiple machines.
> It should be reasonably straightforward, and probably more resilient, to 
> introduce a "synchronization process", which all other LCF processes talk to 
> in order to manage locks, shared data, and other synchronization activities.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to