Let me know if anyone want to collaborate with me on this. Thanks, Chandni
On Tue, Nov 10, 2015 at 6:18 PM, Chandni Singh <[email protected]> wrote: > Have added some more details about a Bucket in the document. Have a look. > > On Sun, Nov 8, 2015 at 10:37 PM, Chandni Singh <[email protected]> > wrote: > >> Forgot to attach the link. >> >> https://docs.google.com/document/d/1gRWN9ufKSZSZD0N-pthlhpC9TZ8KwJ6hJlAX6nxl5f8/edit#heading=h.wlc0p58uzygb >> >> >> On Sun, Nov 8, 2015 at 10:36 PM, Chandni Singh <[email protected]> >> wrote: >> >>> Hi, >>> This contains the overview of large state management. >>> Some parts need more description which I am working on but please free >>> to go through it and any feedback is appreciated. >>> >>> Thanks, >>> Chandni >>> >>> >>> On Tue, Oct 20, 2015 at 8:31 AM, Pramod Immaneni <[email protected] >>> > wrote: >>> >>>> This is a much needed component Chandni. >>>> >>>> The API for the cache will be important as users will be able to plugin >>>> different implementations in future like those based off of popular >>>> distributed in-memory caches. Ehcache is a popular cache mechanism and >>>> API >>>> that comes to bind. It comes bundled with a non-distributed >>>> implementation >>>> but there are commercial distributed implementations of it as well like >>>> BigMemory. >>>> >>>> Given our needs for fault tolerance we may not be able to adopt the >>>> ehcache >>>> API as is but an extension of it might work. We would still provide a >>>> default implementation but going off of a well recognized API will >>>> facilitate development of other implementations in future based off of >>>> popular implementations already available. We will need to investigate >>>> if >>>> we can use the API as is or with relatively straightforward extensions >>>> which will be a positive for using it. But if the API turns out to be >>>> significantly deviating from what we need then that would be a negative. >>>> >>>> Also it would be great if we could support an iterator to scan all the >>>> keys, lazy loading as needed, since this need comes up from time to >>>> time in >>>> different scenarios such as change data capture calculations. >>>> >>>> Thanks. >>>> >>>> On Mon, Oct 19, 2015 at 9:10 PM, Chandni Singh <[email protected] >>>> > >>>> wrote: >>>> >>>> > Hi All, >>>> > >>>> > While working on making the Join operator fault-tolerant, we realized >>>> the >>>> > need of a fault-tolerant Cache in Malhar library. >>>> > >>>> > This cache is useful for any operator which is state-full and stores >>>> > key/values for a very long period (more than an hour). >>>> > >>>> > The problem with just having a non-transient HashMap for the cache is >>>> that >>>> > over a period of time this state will become so large that >>>> checkpointing it >>>> > will be very costly and will cause bigger issues. >>>> > >>>> > In order to address this we need to checkpoint the state iteratively, >>>> i.e., >>>> > save the difference in state at every application window. >>>> > >>>> > This brings forward the following broad requirements for the cache: >>>> > 1. The cache needs to have a max size and is backed by a filesystem. >>>> > >>>> > 2. When this threshold is reached, then adding more data to it should >>>> evict >>>> > older entries from memory. >>>> > >>>> > 3. To minimize cache misses, a block of data is loaded in memory. >>>> > >>>> > 4. A block or bucket to which a key belongs is provided by the user >>>> > (operator in this case) as the information about closeness in keys >>>> (that >>>> > can potentially reduce future misses) is not known to the cache but >>>> to the >>>> > user. >>>> > >>>> > 5. lazy load the keys in case of operator failure >>>> > >>>> > 6. To offset the cost of loading a block of keys when there is a miss, >>>> > loading can be done asynchronously with a callback that indicates >>>> when the >>>> > key is available. This allows the operator to process other keys >>>> which are >>>> > in memory. >>>> > >>>> > 7. data that is spilled over needs to be purged when it is not needed >>>> > anymore. >>>> > >>>> > >>>> > In past we solved this problem with BucketManager which is not in open >>>> > source now and also there were some limitations with the bucket api - >>>> the >>>> > biggest one is that it doesn't allow to save multiple values for a >>>> key. >>>> > >>>> > My plan is to create a similar solution as BucketManager in Malhar >>>> with >>>> > improved api. >>>> > Also save the data on hdfs in TFile which provides better performance >>>> when >>>> > saving key/values. >>>> > >>>> > Thanks, >>>> > Chandni >>>> > >>>> >>> >>> >> >
