Hi All,

While working on making the Join operator fault-tolerant, we realized the
need of a fault-tolerant Cache in Malhar library.

This cache is useful for any operator which is state-full and stores
key/values for a very long period (more than an hour).

The problem with just having a non-transient HashMap for the cache is that
over a period of time this state will become so large that checkpointing it
will be very costly and will cause bigger issues.

In order to address this we need to checkpoint the state iteratively, i.e.,
save the difference in state at every application window.

This brings forward the following broad requirements for the cache:
1. The cache needs to have a max size and is backed by a filesystem.

2. When this threshold is reached, then adding more data to it should evict
older entries from memory.

3. To minimize cache misses, a block of data is loaded in memory.

4. A block or bucket to which a key belongs is provided by the user
(operator in this case) as the information about closeness in keys (that
can potentially reduce future misses) is not known to the cache but to the
user.

5. lazy load the keys in case of operator failure

6. To offset the cost of loading a block of keys when there is a miss,
loading can be done asynchronously with a callback that indicates when the
key is available. This allows the operator to process other keys which are
in memory.

7. data that is spilled over needs to be purged when it is not needed
anymore.


In past we solved this problem with BucketManager which is not in open
source now and also there were some limitations with the bucket api - the
biggest one is that it doesn't allow to save multiple values for a key.

My plan is to create a similar solution as BucketManager in Malhar with
improved api.
Also save the data on hdfs in TFile which provides better performance when
saving key/values.

Thanks,
Chandni

Reply via email to