Re: Operator checkpointing in distributed in-memory store

Gaurav Gupta Wed, 02 Dec 2015 09:57:09 -0800

Just to add you can plugin your storage agent using attribute STORAGE_AGENT 
(https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.OperatorContext.html#STORAGE_AGENT)


Thanks
- Gaurav

> On Dec 2, 2015, at 9:51 AM, Gaurav Gupta <[email protected]> wrote:
> 
> Ashish,
> 
> You are right that Exactly once semantics can’t be achieved through Async FS 
> write. 
> Did you try new StorageAgent with your Application? If yes do you have any 
> numbers to compare?
> 
> Thanks
> - Gaurav
> 
>> On Dec 2, 2015, at 9:33 AM, Ashish Tadose <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Application uses large number of in-memory dimension store partitions to
>> hold high cardinally aggregated data and also many intermediate operators
>> keep cache data for reference look ups which are not-transient.
>> 
>> Total application partitions were more than 1000 which makes lot of
>> operator to checkpoint and in term lot of frequent Hdfs write, rename &
>> delete operations which became bottleneck.
>> 
>> Application requires Exactly once semantics with idempotent operators which
>> I suppose can not be achieved through Async fs writes, please correct me If
>> I'm wrong here.
>> 
>> Also application computes streaming aggregations of high cardinality
>> incoming data streams and reference caches are update frequently so not
>> sure how much incremental checkpointing will help here.
>> 
>> Despite this specific application I strongly think it would be good to have
>> StorageAgent backed by distributed in-memory store as alternative in
>> platform.
>> 
>> Ashish
>> 
>> 
>> 
>> On Wed, Dec 2, 2015 at 10:35 PM, Munagala Ramanath <[email protected] 
>> <mailto:[email protected]>>
>> wrote:
>> 
>>> Ashish,
>>> 
>>> In the current release, the HDFS writes are asynchronous so I'm wondering
>>> if
>>> you could elaborate on how much latency you are observing both with and
>>> without
>>> checkpointing (i.e. after your changes to make operators stateless).
>>> 
>>> Also any information on how much non-transient data is being checkpointed
>>> in
>>> each operator would also be useful. There is an effort under way to
>>> implement
>>> incremental checkpointing which should improve things when there is a lot
>>> state
>>> but very little that changes from window to window.
>>> 
>>> Ram
>>> 
>>> 
>>> On Wed, Dec 2, 2015 at 8:51 AM, Ashish Tadose <[email protected] 
>>> <mailto:[email protected]>>
>>> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> Currently Apex engine provides operator checkpointing in Hdfs ( with Hdfs
>>>> backed StorageAgents i.e. FSStorageAgent & AsyncFSStorageAgent )
>>>> 
>>>> We have observed that for applications having large number of operator
>>>> instances, hdfs checkpointing introduces latency in DAG which degrades
>>>> overall application performance.
>>>> To resolve this we had to review all operators in DAG and had to make few
>>>> operators stateless.
>>>> 
>>>> As operator check-pointing is critical functionality of Apex streaming
>>>> platform to ensure fault tolerant behavior, platform should also provide
>>>> alternate StorageAgents which will work seamlessly with large
>>> applications
>>>> that requires Exactly once semantics.
>>>> 
>>>> HDFS read/write latency is limited and doesn't improve beyond certain
>>> point
>>>> because of disk io & staging writes. Having alternate strategy to this
>>>> check-pointing in fault tolerant distributed in-memory grid would ensure
>>>> application stability and performance is not impacted.
>>>> 
>>>> I have developed a in-memory storage agent which I would like to
>>> contribute
>>>> as alternate StorageAgent for checkpointing.
>>>> 
>>>> Thanks,
>>>> Ashish
>>>> 
>>> 
>

Re: Operator checkpointing in distributed in-memory store

Reply via email to