Forgot to mention this would be a lazy deletion/purging strategy

Thanks,
Tim

On Fri, Nov 27, 2015 at 10:26 PM, Timothy Farkas <[email protected]>
wrote:

> Hey Chandni,
>
> I was thinking about how to implement purging. Would it be possible to
> implement purging as follows:
>
>    - keep a timestamp for each key in a bucket data file
>    - When a bucket data file is updated, scan the time stamp for each key
> and remove expired keys from the new bucket data file
>
> The con of this approach is that expired keys are only removed when their
> data file is updated.
>
> Thanks,
> Tim
>
> On Fri, Nov 13, 2015 at 7:32 PM, Chandni Singh <[email protected]>
> wrote:
>
>> Let me know if anyone want to collaborate with me on this.
>>
>> Thanks,
>> Chandni
>>
>> On Tue, Nov 10, 2015 at 6:18 PM, Chandni Singh <[email protected]>
>> wrote:
>>
>> > Have added some more details about a Bucket in the document. Have a
>> look.
>> >
>> > On Sun, Nov 8, 2015 at 10:37 PM, Chandni Singh <[email protected]
>> >
>> > wrote:
>> >
>> >> Forgot to attach the link.
>> >>
>> >>
>> https://docs.google.com/document/d/1gRWN9ufKSZSZD0N-pthlhpC9TZ8KwJ6hJlAX6nxl5f8/edit#heading=h.wlc0p58uzygb
>> >>
>> >>
>> >> On Sun, Nov 8, 2015 at 10:36 PM, Chandni Singh <
>> [email protected]>
>> >> wrote:
>> >>
>> >>> Hi,
>> >>> This contains the overview of large state management.
>> >>> Some parts need more description which I am working on but please free
>> >>> to go through it and any feedback is appreciated.
>> >>>
>> >>> Thanks,
>> >>> Chandni
>> >>>
>> >>>
>> >>> On Tue, Oct 20, 2015 at 8:31 AM, Pramod Immaneni <
>> [email protected]
>> >>> > wrote:
>> >>>
>> >>>> This is a much needed component Chandni.
>> >>>>
>> >>>> The API for the cache will be important as users will be able to
>> plugin
>> >>>> different implementations in future like those based off of popular
>> >>>> distributed in-memory caches. Ehcache is a popular cache mechanism
>> and
>> >>>> API
>> >>>> that comes to bind. It comes bundled with a non-distributed
>> >>>> implementation
>> >>>> but there are commercial distributed implementations of it as well
>> like
>> >>>> BigMemory.
>> >>>>
>> >>>> Given our needs for fault tolerance we may not be able to adopt the
>> >>>> ehcache
>> >>>> API as is but an extension of it might work. We would still provide a
>> >>>> default implementation but going off of a well recognized API will
>> >>>> facilitate development of other implementations in future based off
>> of
>> >>>> popular implementations already available. We will need to
>> investigate
>> >>>> if
>> >>>> we can use the API as is or with relatively straightforward
>> extensions
>> >>>> which will be a positive for using it. But if the API turns out to be
>> >>>> significantly deviating from what we need then that would be a
>> negative.
>> >>>>
>> >>>> Also it would be great if we could support an iterator to scan all
>> the
>> >>>> keys, lazy loading as needed, since this need comes up from time to
>> >>>> time in
>> >>>> different scenarios such as change data capture calculations.
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> On Mon, Oct 19, 2015 at 9:10 PM, Chandni Singh <
>> [email protected]
>> >>>> >
>> >>>> wrote:
>> >>>>
>> >>>> > Hi All,
>> >>>> >
>> >>>> > While working on making the Join operator fault-tolerant, we
>> realized
>> >>>> the
>> >>>> > need of a fault-tolerant Cache in Malhar library.
>> >>>> >
>> >>>> > This cache is useful for any operator which is state-full and
>> stores
>> >>>> > key/values for a very long period (more than an hour).
>> >>>> >
>> >>>> > The problem with just having a non-transient HashMap for the cache
>> is
>> >>>> that
>> >>>> > over a period of time this state will become so large that
>> >>>> checkpointing it
>> >>>> > will be very costly and will cause bigger issues.
>> >>>> >
>> >>>> > In order to address this we need to checkpoint the state
>> iteratively,
>> >>>> i.e.,
>> >>>> > save the difference in state at every application window.
>> >>>> >
>> >>>> > This brings forward the following broad requirements for the cache:
>> >>>> > 1. The cache needs to have a max size and is backed by a
>> filesystem.
>> >>>> >
>> >>>> > 2. When this threshold is reached, then adding more data to it
>> should
>> >>>> evict
>> >>>> > older entries from memory.
>> >>>> >
>> >>>> > 3. To minimize cache misses, a block of data is loaded in memory.
>> >>>> >
>> >>>> > 4. A block or bucket to which a key belongs is provided by the user
>> >>>> > (operator in this case) as the information about closeness in keys
>> >>>> (that
>> >>>> > can potentially reduce future misses) is not known to the cache but
>> >>>> to the
>> >>>> > user.
>> >>>> >
>> >>>> > 5. lazy load the keys in case of operator failure
>> >>>> >
>> >>>> > 6. To offset the cost of loading a block of keys when there is a
>> miss,
>> >>>> > loading can be done asynchronously with a callback that indicates
>> >>>> when the
>> >>>> > key is available. This allows the operator to process other keys
>> >>>> which are
>> >>>> > in memory.
>> >>>> >
>> >>>> > 7. data that is spilled over needs to be purged when it is not
>> needed
>> >>>> > anymore.
>> >>>> >
>> >>>> >
>> >>>> > In past we solved this problem with BucketManager which is not in
>> open
>> >>>> > source now and also there were some limitations with the bucket
>> api -
>> >>>> the
>> >>>> > biggest one is that it doesn't allow to save multiple values for a
>> >>>> key.
>> >>>> >
>> >>>> > My plan is to create a similar solution as BucketManager in Malhar
>> >>>> with
>> >>>> > improved api.
>> >>>> > Also save the data on hdfs in TFile which provides better
>> performance
>> >>>> when
>> >>>> > saving key/values.
>> >>>> >
>> >>>> > Thanks,
>> >>>> > Chandni
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>>
>
>

Reply via email to