Forgot to mention this would be a lazy deletion/purging strategy Thanks, Tim
On Fri, Nov 27, 2015 at 10:26 PM, Timothy Farkas <[email protected]> wrote: > Hey Chandni, > > I was thinking about how to implement purging. Would it be possible to > implement purging as follows: > > - keep a timestamp for each key in a bucket data file > - When a bucket data file is updated, scan the time stamp for each key > and remove expired keys from the new bucket data file > > The con of this approach is that expired keys are only removed when their > data file is updated. > > Thanks, > Tim > > On Fri, Nov 13, 2015 at 7:32 PM, Chandni Singh <[email protected]> > wrote: > >> Let me know if anyone want to collaborate with me on this. >> >> Thanks, >> Chandni >> >> On Tue, Nov 10, 2015 at 6:18 PM, Chandni Singh <[email protected]> >> wrote: >> >> > Have added some more details about a Bucket in the document. Have a >> look. >> > >> > On Sun, Nov 8, 2015 at 10:37 PM, Chandni Singh <[email protected] >> > >> > wrote: >> > >> >> Forgot to attach the link. >> >> >> >> >> https://docs.google.com/document/d/1gRWN9ufKSZSZD0N-pthlhpC9TZ8KwJ6hJlAX6nxl5f8/edit#heading=h.wlc0p58uzygb >> >> >> >> >> >> On Sun, Nov 8, 2015 at 10:36 PM, Chandni Singh < >> [email protected]> >> >> wrote: >> >> >> >>> Hi, >> >>> This contains the overview of large state management. >> >>> Some parts need more description which I am working on but please free >> >>> to go through it and any feedback is appreciated. >> >>> >> >>> Thanks, >> >>> Chandni >> >>> >> >>> >> >>> On Tue, Oct 20, 2015 at 8:31 AM, Pramod Immaneni < >> [email protected] >> >>> > wrote: >> >>> >> >>>> This is a much needed component Chandni. >> >>>> >> >>>> The API for the cache will be important as users will be able to >> plugin >> >>>> different implementations in future like those based off of popular >> >>>> distributed in-memory caches. Ehcache is a popular cache mechanism >> and >> >>>> API >> >>>> that comes to bind. It comes bundled with a non-distributed >> >>>> implementation >> >>>> but there are commercial distributed implementations of it as well >> like >> >>>> BigMemory. >> >>>> >> >>>> Given our needs for fault tolerance we may not be able to adopt the >> >>>> ehcache >> >>>> API as is but an extension of it might work. We would still provide a >> >>>> default implementation but going off of a well recognized API will >> >>>> facilitate development of other implementations in future based off >> of >> >>>> popular implementations already available. We will need to >> investigate >> >>>> if >> >>>> we can use the API as is or with relatively straightforward >> extensions >> >>>> which will be a positive for using it. But if the API turns out to be >> >>>> significantly deviating from what we need then that would be a >> negative. >> >>>> >> >>>> Also it would be great if we could support an iterator to scan all >> the >> >>>> keys, lazy loading as needed, since this need comes up from time to >> >>>> time in >> >>>> different scenarios such as change data capture calculations. >> >>>> >> >>>> Thanks. >> >>>> >> >>>> On Mon, Oct 19, 2015 at 9:10 PM, Chandni Singh < >> [email protected] >> >>>> > >> >>>> wrote: >> >>>> >> >>>> > Hi All, >> >>>> > >> >>>> > While working on making the Join operator fault-tolerant, we >> realized >> >>>> the >> >>>> > need of a fault-tolerant Cache in Malhar library. >> >>>> > >> >>>> > This cache is useful for any operator which is state-full and >> stores >> >>>> > key/values for a very long period (more than an hour). >> >>>> > >> >>>> > The problem with just having a non-transient HashMap for the cache >> is >> >>>> that >> >>>> > over a period of time this state will become so large that >> >>>> checkpointing it >> >>>> > will be very costly and will cause bigger issues. >> >>>> > >> >>>> > In order to address this we need to checkpoint the state >> iteratively, >> >>>> i.e., >> >>>> > save the difference in state at every application window. >> >>>> > >> >>>> > This brings forward the following broad requirements for the cache: >> >>>> > 1. The cache needs to have a max size and is backed by a >> filesystem. >> >>>> > >> >>>> > 2. When this threshold is reached, then adding more data to it >> should >> >>>> evict >> >>>> > older entries from memory. >> >>>> > >> >>>> > 3. To minimize cache misses, a block of data is loaded in memory. >> >>>> > >> >>>> > 4. A block or bucket to which a key belongs is provided by the user >> >>>> > (operator in this case) as the information about closeness in keys >> >>>> (that >> >>>> > can potentially reduce future misses) is not known to the cache but >> >>>> to the >> >>>> > user. >> >>>> > >> >>>> > 5. lazy load the keys in case of operator failure >> >>>> > >> >>>> > 6. To offset the cost of loading a block of keys when there is a >> miss, >> >>>> > loading can be done asynchronously with a callback that indicates >> >>>> when the >> >>>> > key is available. This allows the operator to process other keys >> >>>> which are >> >>>> > in memory. >> >>>> > >> >>>> > 7. data that is spilled over needs to be purged when it is not >> needed >> >>>> > anymore. >> >>>> > >> >>>> > >> >>>> > In past we solved this problem with BucketManager which is not in >> open >> >>>> > source now and also there were some limitations with the bucket >> api - >> >>>> the >> >>>> > biggest one is that it doesn't allow to save multiple values for a >> >>>> key. >> >>>> > >> >>>> > My plan is to create a similar solution as BucketManager in Malhar >> >>>> with >> >>>> > improved api. >> >>>> > Also save the data on hdfs in TFile which provides better >> performance >> >>>> when >> >>>> > saving key/values. >> >>>> > >> >>>> > Thanks, >> >>>> > Chandni >> >>>> > >> >>>> >> >>> >> >>> >> >> >> > >> > >
