Re: Fault tolerance and communication

Matthieu Morel Thu, 21 Mar 2013 02:25:54 -0700

Please ask user questions on the s4-user list (cced) thanks!

On Mar 21, 2013, at 04:06 , Dingyu Yang wrote:


> Hi,all
> I test the section of fault tolerance, but can not recover the state of
> failed node:
> I have a adapter and one app node, one stand-by node. The checkpoint is
> doing with the baseconfig of 20 seconds.
> When app node is stop, the stand-by node can acquire a task, but the state
> is not recovered.
> You  can check or i have to do some other configs.

This looks like a configuration/environment issue. Which version are you using? 
(recommended is S4 0.6 RC3)

If you use the file system checkpointing backend, make sure the files are 
accessible from failover nodes.
You can also specify where the containing directory is, e.g. 
-p=s4.checkpointing.filesystem.storageRootPath=/path/to/shared-dir
 

> 
> Another problem is that the communication between adapter and app.
> I test the experiment of word count, a 500M file with 80775764 words.
> multiple nodes for app partitions, one node for adapter.
> I test one adatper node and one app node, the adapter sending all the words
> is done with 35 seconds.
> one adatper node and two app node, the adapter is done with 61 seconds.
> one adatper node and three app node, the adapter is done with 95 seconds.
> 
> The adapter node is a same node and same program.
> The time of adapter should be same or less with increasing app nodes, since
> its processing ability has increased.
> I don't know what the problem is.

There were some extra copies in S4 0.5 so if you are using that version it 
could be an explanation.

The pattern is quite clear though (linear increase with number of nodes) and it 
should be easy to spot the issue. Looks like a given operation is repeated for 
each target node. Are you broadcasting to all nodes? Are the events from the 
adapter keyed? Is there something specifically related in your adapter app code 
or adapter app graph?


Regards,

Matthieu

Re: Fault tolerance and communication

Reply via email to