Have added some more details about a Bucket in the document. Have a look. On Sun, Nov 8, 2015 at 10:37 PM, Chandni Singh <[email protected]> wrote:
> Forgot to attach the link. > > https://docs.google.com/document/d/1gRWN9ufKSZSZD0N-pthlhpC9TZ8KwJ6hJlAX6nxl5f8/edit#heading=h.wlc0p58uzygb > > > On Sun, Nov 8, 2015 at 10:36 PM, Chandni Singh <[email protected]> > wrote: > >> Hi, >> This contains the overview of large state management. >> Some parts need more description which I am working on but please free to >> go through it and any feedback is appreciated. >> >> Thanks, >> Chandni >> >> >> On Tue, Oct 20, 2015 at 8:31 AM, Pramod Immaneni <[email protected]> >> wrote: >> >>> This is a much needed component Chandni. >>> >>> The API for the cache will be important as users will be able to plugin >>> different implementations in future like those based off of popular >>> distributed in-memory caches. Ehcache is a popular cache mechanism and >>> API >>> that comes to bind. It comes bundled with a non-distributed >>> implementation >>> but there are commercial distributed implementations of it as well like >>> BigMemory. >>> >>> Given our needs for fault tolerance we may not be able to adopt the >>> ehcache >>> API as is but an extension of it might work. We would still provide a >>> default implementation but going off of a well recognized API will >>> facilitate development of other implementations in future based off of >>> popular implementations already available. We will need to investigate if >>> we can use the API as is or with relatively straightforward extensions >>> which will be a positive for using it. But if the API turns out to be >>> significantly deviating from what we need then that would be a negative. >>> >>> Also it would be great if we could support an iterator to scan all the >>> keys, lazy loading as needed, since this need comes up from time to time >>> in >>> different scenarios such as change data capture calculations. >>> >>> Thanks. >>> >>> On Mon, Oct 19, 2015 at 9:10 PM, Chandni Singh <[email protected]> >>> wrote: >>> >>> > Hi All, >>> > >>> > While working on making the Join operator fault-tolerant, we realized >>> the >>> > need of a fault-tolerant Cache in Malhar library. >>> > >>> > This cache is useful for any operator which is state-full and stores >>> > key/values for a very long period (more than an hour). >>> > >>> > The problem with just having a non-transient HashMap for the cache is >>> that >>> > over a period of time this state will become so large that >>> checkpointing it >>> > will be very costly and will cause bigger issues. >>> > >>> > In order to address this we need to checkpoint the state iteratively, >>> i.e., >>> > save the difference in state at every application window. >>> > >>> > This brings forward the following broad requirements for the cache: >>> > 1. The cache needs to have a max size and is backed by a filesystem. >>> > >>> > 2. When this threshold is reached, then adding more data to it should >>> evict >>> > older entries from memory. >>> > >>> > 3. To minimize cache misses, a block of data is loaded in memory. >>> > >>> > 4. A block or bucket to which a key belongs is provided by the user >>> > (operator in this case) as the information about closeness in keys >>> (that >>> > can potentially reduce future misses) is not known to the cache but to >>> the >>> > user. >>> > >>> > 5. lazy load the keys in case of operator failure >>> > >>> > 6. To offset the cost of loading a block of keys when there is a miss, >>> > loading can be done asynchronously with a callback that indicates when >>> the >>> > key is available. This allows the operator to process other keys which >>> are >>> > in memory. >>> > >>> > 7. data that is spilled over needs to be purged when it is not needed >>> > anymore. >>> > >>> > >>> > In past we solved this problem with BucketManager which is not in open >>> > source now and also there were some limitations with the bucket api - >>> the >>> > biggest one is that it doesn't allow to save multiple values for a key. >>> > >>> > My plan is to create a similar solution as BucketManager in Malhar with >>> > improved api. >>> > Also save the data on hdfs in TFile which provides better performance >>> when >>> > saving key/values. >>> > >>> > Thanks, >>> > Chandni >>> > >>> >> >> >
