Re: Refactoring of IdempotentStorageManager

Tushar Gosavi Tue, 17 Nov 2015 21:58:59 -0800

Hi Chandni,

Let me know if you need any help to extend HDHT WAL for this purpose. The
current functionality supports writing objects sequentially to a file and
reading sequentially from file. The serialization and deserialization needs
to be handled by upper level code.


Regards,
- Tushar.


On Wed, Nov 18, 2015 at 4:20 AM, Chandni Singh <[email protected]>
wrote:

> Hi,
>
> As part of creating a ManagedState, it was brought up that
> IdempotentStorageManager needs to be re-factored:
> 1. Re-name (have started a discussion about it in a separate thread).
>
> 2. It should be a layer over Write-ahead-log (WAL) abstraction which was
> created in HDHT.
>
> This discussion is about 2nd point.
>
> Currently IdempotentStorageManager is an abstraction above StorageAgent
> (which is in Apex core).
> IMO this doesn't need to change. We have integrated
> IdempotentStorageManager with various input/output operators in Malhar lib
> and this abstraction works well.
>
> However the change that I think we need to make (which could be later) is
> that IdempotentStorageManager.FSIdempotentStorageManager can use (contain)
> WAL to write state to files.
>
> Advantages of this approach:
> 1. Operators will not be needed to change because api of
> IdempotentStorageManager doesn't change. There are quite a few of them.
> 2. Parallel work is going on HDHT WAL which currently makes it very
> difficult to move stuff retaining attribution.
> 3. Once WAL abstraction is ready, FSIdempotentStorageManager can use it
> again without affecting the operators.
> 4. This is in-line with iterative development :)
>
> Chandni
>
> On Fri, Nov 13, 2015 at 7:32 PM, Chandni Singh <[email protected]>
> wrote:
>
> > Let me know if anyone want to collaborate with me on this.
> >
> > Thanks,
> > Chandni
> >
> > On Tue, Nov 10, 2015 at 6:18 PM, Chandni Singh <[email protected]>
> > wrote:
> >
> >> Have added some more details about a Bucket in the document. Have a
> look.
> >>
> >> On Sun, Nov 8, 2015 at 10:37 PM, Chandni Singh <[email protected]
> >
> >> wrote:
> >>
> >>> Forgot to attach the link.
> >>>
> >>>
> https://docs.google.com/document/d/1gRWN9ufKSZSZD0N-pthlhpC9TZ8KwJ6hJlAX6nxl5f8/edit#heading=h.wlc0p58uzygb
> >>>
> >>>
> >>> On Sun, Nov 8, 2015 at 10:36 PM, Chandni Singh <
> [email protected]>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>> This contains the overview of large state management.
> >>>> Some parts need more description which I am working on but please free
> >>>> to go through it and any feedback is appreciated.
> >>>>
> >>>> Thanks,
> >>>> Chandni
> >>>>
> >>>>
> >>>> On Tue, Oct 20, 2015 at 8:31 AM, Pramod Immaneni <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> This is a much needed component Chandni.
> >>>>>
> >>>>> The API for the cache will be important as users will be able to
> plugin
> >>>>> different implementations in future like those based off of popular
> >>>>> distributed in-memory caches. Ehcache is a popular cache mechanism
> and
> >>>>> API
> >>>>> that comes to bind. It comes bundled with a non-distributed
> >>>>> implementation
> >>>>> but there are commercial distributed implementations of it as well
> like
> >>>>> BigMemory.
> >>>>>
> >>>>> Given our needs for fault tolerance we may not be able to adopt the
> >>>>> ehcache
> >>>>> API as is but an extension of it might work. We would still provide a
> >>>>> default implementation but going off of a well recognized API will
> >>>>> facilitate development of other implementations in future based off
> of
> >>>>> popular implementations already available. We will need to
> investigate
> >>>>> if
> >>>>> we can use the API as is or with relatively straightforward
> extensions
> >>>>> which will be a positive for using it. But if the API turns out to be
> >>>>> significantly deviating from what we need then that would be a
> >>>>> negative.
> >>>>>
> >>>>> Also it would be great if we could support an iterator to scan all
> the
> >>>>> keys, lazy loading as needed, since this need comes up from time to
> >>>>> time in
> >>>>> different scenarios such as change data capture calculations.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> On Mon, Oct 19, 2015 at 9:10 PM, Chandni Singh <
> >>>>> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>> > Hi All,
> >>>>> >
> >>>>> > While working on making the Join operator fault-tolerant, we
> >>>>> realized the
> >>>>> > need of a fault-tolerant Cache in Malhar library.
> >>>>> >
> >>>>> > This cache is useful for any operator which is state-full and
> stores
> >>>>> > key/values for a very long period (more than an hour).
> >>>>> >
> >>>>> > The problem with just having a non-transient HashMap for the cache
> >>>>> is that
> >>>>> > over a period of time this state will become so large that
> >>>>> checkpointing it
> >>>>> > will be very costly and will cause bigger issues.
> >>>>> >
> >>>>> > In order to address this we need to checkpoint the state
> >>>>> iteratively, i.e.,
> >>>>> > save the difference in state at every application window.
> >>>>> >
> >>>>> > This brings forward the following broad requirements for the cache:
> >>>>> > 1. The cache needs to have a max size and is backed by a
> filesystem.
> >>>>> >
> >>>>> > 2. When this threshold is reached, then adding more data to it
> >>>>> should evict
> >>>>> > older entries from memory.
> >>>>> >
> >>>>> > 3. To minimize cache misses, a block of data is loaded in memory.
> >>>>> >
> >>>>> > 4. A block or bucket to which a key belongs is provided by the user
> >>>>> > (operator in this case) as the information about closeness in keys
> >>>>> (that
> >>>>> > can potentially reduce future misses) is not known to the cache but
> >>>>> to the
> >>>>> > user.
> >>>>> >
> >>>>> > 5. lazy load the keys in case of operator failure
> >>>>> >
> >>>>> > 6. To offset the cost of loading a block of keys when there is a
> >>>>> miss,
> >>>>> > loading can be done asynchronously with a callback that indicates
> >>>>> when the
> >>>>> > key is available. This allows the operator to process other keys
> >>>>> which are
> >>>>> > in memory.
> >>>>> >
> >>>>> > 7. data that is spilled over needs to be purged when it is not
> needed
> >>>>> > anymore.
> >>>>> >
> >>>>> >
> >>>>> > In past we solved this problem with BucketManager which is not in
> >>>>> open
> >>>>> > source now and also there were some limitations with the bucket api
> >>>>> - the
> >>>>> > biggest one is that it doesn't allow to save multiple values for a
> >>>>> key.
> >>>>> >
> >>>>> > My plan is to create a similar solution as BucketManager in Malhar
> >>>>> with
> >>>>> > improved api.
> >>>>> > Also save the data on hdfs in TFile which provides better
> >>>>> performance when
> >>>>> > saving key/values.
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Chandni
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: Refactoring of IdempotentStorageManager

Reply via email to