Hi All, Currently Apex engine provides operator checkpointing in Hdfs ( with Hdfs backed StorageAgents i.e. FSStorageAgent & AsyncFSStorageAgent )
We have observed that for applications having large number of operator instances, hdfs checkpointing introduces latency in DAG which degrades overall application performance. To resolve this we had to review all operators in DAG and had to make few operators stateless. As operator check-pointing is critical functionality of Apex streaming platform to ensure fault tolerant behavior, platform should also provide alternate StorageAgents which will work seamlessly with large applications that requires Exactly once semantics. HDFS read/write latency is limited and doesn't improve beyond certain point because of disk io & staging writes. Having alternate strategy to this check-pointing in fault tolerant distributed in-memory grid would ensure application stability and performance is not impacted. I have developed a in-memory storage agent which I would like to contribute as alternate StorageAgent for checkpointing. Thanks, Ashish
