Hi,all I test the section of fault tolerance, but can not recover the state of failed node: I have a adapter and one app node, one stand-by node. The checkpoint is doing with the baseconfig of 20 seconds. When app node is stop, the stand-by node can acquire a task, but the state is not recovered. You can check or i have to do some other configs.
Another problem is that the communication between adapter and app. I test the experiment of word count, a 500M file with 80775764 words. multiple nodes for app partitions, one node for adapter. I test one adatper node and one app node, the adapter sending all the words is done with 35 seconds. one adatper node and two app node, the adapter is done with 61 seconds. one adatper node and three app node, the adapter is done with 95 seconds. The adapter node is a same node and same program. The time of adapter should be same or less with increasing app nodes, since its processing ability has increased. I don't know what the problem is. Thank you! Dingyu
