Hi Tushar, We need to move WAL from HDHT to Malhar/lib. The recent changes that you are making makes it generic and I mentioned there how it can be used internally for Idempotency.
Since you have worked mostly on the HDHT WAL, it will be great if you can move it to Malhar/lib preserving the history and make necessary changes to HDHT. There is already a ticket for it https://malhar.atlassian.net/browse/APEX-99 Let me know if you will like to take this up. Thanks, Chandni On Tue, Nov 17, 2015 at 9:57 PM, Tushar Gosavi <[email protected]> wrote: > Hi Chandni, > > Let me know if you need any help to extend HDHT WAL for this purpose. The > current functionality supports writing objects sequentially to a file and > reading sequentially from file. The serialization and deserialization needs > to be handled by upper level code. > > Regards, > - Tushar. > > > On Wed, Nov 18, 2015 at 4:20 AM, Chandni Singh <[email protected]> > wrote: > > > Hi, > > > > As part of creating a ManagedState, it was brought up that > > IdempotentStorageManager needs to be re-factored: > > 1. Re-name (have started a discussion about it in a separate thread). > > > > 2. It should be a layer over Write-ahead-log (WAL) abstraction which was > > created in HDHT. > > > > This discussion is about 2nd point. > > > > Currently IdempotentStorageManager is an abstraction above StorageAgent > > (which is in Apex core). > > IMO this doesn't need to change. We have integrated > > IdempotentStorageManager with various input/output operators in Malhar > lib > > and this abstraction works well. > > > > However the change that I think we need to make (which could be later) is > > that IdempotentStorageManager.FSIdempotentStorageManager can use > (contain) > > WAL to write state to files. > > > > Advantages of this approach: > > 1. Operators will not be needed to change because api of > > IdempotentStorageManager doesn't change. There are quite a few of them. > > 2. Parallel work is going on HDHT WAL which currently makes it very > > difficult to move stuff retaining attribution. > > 3. Once WAL abstraction is ready, FSIdempotentStorageManager can use it > > again without affecting the operators. > > 4. This is in-line with iterative development :) > > > > Chandni > > > > On Fri, Nov 13, 2015 at 7:32 PM, Chandni Singh <[email protected]> > > wrote: > > > > > Let me know if anyone want to collaborate with me on this. > > > > > > Thanks, > > > Chandni > > > > > > On Tue, Nov 10, 2015 at 6:18 PM, Chandni Singh < > [email protected]> > > > wrote: > > > > > >> Have added some more details about a Bucket in the document. Have a > > look. > > >> > > >> On Sun, Nov 8, 2015 at 10:37 PM, Chandni Singh < > [email protected] > > > > > >> wrote: > > >> > > >>> Forgot to attach the link. > > >>> > > >>> > > > https://docs.google.com/document/d/1gRWN9ufKSZSZD0N-pthlhpC9TZ8KwJ6hJlAX6nxl5f8/edit#heading=h.wlc0p58uzygb > > >>> > > >>> > > >>> On Sun, Nov 8, 2015 at 10:36 PM, Chandni Singh < > > [email protected]> > > >>> wrote: > > >>> > > >>>> Hi, > > >>>> This contains the overview of large state management. > > >>>> Some parts need more description which I am working on but please > free > > >>>> to go through it and any feedback is appreciated. > > >>>> > > >>>> Thanks, > > >>>> Chandni > > >>>> > > >>>> > > >>>> On Tue, Oct 20, 2015 at 8:31 AM, Pramod Immaneni < > > >>>> [email protected]> wrote: > > >>>> > > >>>>> This is a much needed component Chandni. > > >>>>> > > >>>>> The API for the cache will be important as users will be able to > > plugin > > >>>>> different implementations in future like those based off of popular > > >>>>> distributed in-memory caches. Ehcache is a popular cache mechanism > > and > > >>>>> API > > >>>>> that comes to bind. It comes bundled with a non-distributed > > >>>>> implementation > > >>>>> but there are commercial distributed implementations of it as well > > like > > >>>>> BigMemory. > > >>>>> > > >>>>> Given our needs for fault tolerance we may not be able to adopt the > > >>>>> ehcache > > >>>>> API as is but an extension of it might work. We would still > provide a > > >>>>> default implementation but going off of a well recognized API will > > >>>>> facilitate development of other implementations in future based off > > of > > >>>>> popular implementations already available. We will need to > > investigate > > >>>>> if > > >>>>> we can use the API as is or with relatively straightforward > > extensions > > >>>>> which will be a positive for using it. But if the API turns out to > be > > >>>>> significantly deviating from what we need then that would be a > > >>>>> negative. > > >>>>> > > >>>>> Also it would be great if we could support an iterator to scan all > > the > > >>>>> keys, lazy loading as needed, since this need comes up from time to > > >>>>> time in > > >>>>> different scenarios such as change data capture calculations. > > >>>>> > > >>>>> Thanks. > > >>>>> > > >>>>> On Mon, Oct 19, 2015 at 9:10 PM, Chandni Singh < > > >>>>> [email protected]> > > >>>>> wrote: > > >>>>> > > >>>>> > Hi All, > > >>>>> > > > >>>>> > While working on making the Join operator fault-tolerant, we > > >>>>> realized the > > >>>>> > need of a fault-tolerant Cache in Malhar library. > > >>>>> > > > >>>>> > This cache is useful for any operator which is state-full and > > stores > > >>>>> > key/values for a very long period (more than an hour). > > >>>>> > > > >>>>> > The problem with just having a non-transient HashMap for the > cache > > >>>>> is that > > >>>>> > over a period of time this state will become so large that > > >>>>> checkpointing it > > >>>>> > will be very costly and will cause bigger issues. > > >>>>> > > > >>>>> > In order to address this we need to checkpoint the state > > >>>>> iteratively, i.e., > > >>>>> > save the difference in state at every application window. > > >>>>> > > > >>>>> > This brings forward the following broad requirements for the > cache: > > >>>>> > 1. The cache needs to have a max size and is backed by a > > filesystem. > > >>>>> > > > >>>>> > 2. When this threshold is reached, then adding more data to it > > >>>>> should evict > > >>>>> > older entries from memory. > > >>>>> > > > >>>>> > 3. To minimize cache misses, a block of data is loaded in memory. > > >>>>> > > > >>>>> > 4. A block or bucket to which a key belongs is provided by the > user > > >>>>> > (operator in this case) as the information about closeness in > keys > > >>>>> (that > > >>>>> > can potentially reduce future misses) is not known to the cache > but > > >>>>> to the > > >>>>> > user. > > >>>>> > > > >>>>> > 5. lazy load the keys in case of operator failure > > >>>>> > > > >>>>> > 6. To offset the cost of loading a block of keys when there is a > > >>>>> miss, > > >>>>> > loading can be done asynchronously with a callback that indicates > > >>>>> when the > > >>>>> > key is available. This allows the operator to process other keys > > >>>>> which are > > >>>>> > in memory. > > >>>>> > > > >>>>> > 7. data that is spilled over needs to be purged when it is not > > needed > > >>>>> > anymore. > > >>>>> > > > >>>>> > > > >>>>> > In past we solved this problem with BucketManager which is not in > > >>>>> open > > >>>>> > source now and also there were some limitations with the bucket > api > > >>>>> - the > > >>>>> > biggest one is that it doesn't allow to save multiple values for > a > > >>>>> key. > > >>>>> > > > >>>>> > My plan is to create a similar solution as BucketManager in > Malhar > > >>>>> with > > >>>>> > improved api. > > >>>>> > Also save the data on hdfs in TFile which provides better > > >>>>> performance when > > >>>>> > saving key/values. > > >>>>> > > > >>>>> > Thanks, > > >>>>> > Chandni > > >>>>> > > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > > > > >
