Chandni, The approach that we discussed was to organize buckets (not keys) by time, which allows you to discard the entire file on purge.
In addition, when you use a block indexed file format, there is no need to load the entire file to retrieve a single key. Thomas On Sat, Nov 28, 2015 at 9:42 AM, Chandni Singh <[email protected]> wrote: > Another approach is to treat an entry in bucket data file as: > <time><key><value> > > Time can be extract from the tuple (or windowId can be used). > With this approach purging can be simple. For each Bucket Data File we > check the last entry (since data is sorted in the bucket data file) and > delete if that is expired. > > Writing to bucket data file can be simple. We will not update value of a > key, but always add a new entry for the key when its value changes. > Cons- multiple entries for a key. > If the tuples are not out of order then we may never have to re-write a > bucket data file that is complete. > > Reading is a problem here. The whole bucket needs to be de-serialized to > find a key since data is no longer sorted on disk. If the query for a key > specifies a time range then that read can be optimized. > > > With Tim's approach, purging can be triggered asynchronously at regular > intervals that may even delete data file which hasn't been updated for > sometime and the latest entry in that file is expired. > Even though the writes may not be that complicated with this approach but > updating values when the length of value changes (example in join operation > a value is an appending list) may result in many small stray files. > > Chandni >
