Another approach is to treat an entry in bucket data file as: <time><key><value>
Time can be extract from the tuple (or windowId can be used). With this approach purging can be simple. For each Bucket Data File we check the last entry (since data is sorted in the bucket data file) and delete if that is expired. Writing to bucket data file can be simple. We will not update value of a key, but always add a new entry for the key when its value changes. Cons- multiple entries for a key. If the tuples are not out of order then we may never have to re-write a bucket data file that is complete. Reading is a problem here. The whole bucket needs to be de-serialized to find a key since data is no longer sorted on disk. If the query for a key specifies a time range then that read can be optimized. With Tim's approach, purging can be triggered asynchronously at regular intervals that may even delete data file which hasn't been updated for sometime and the latest entry in that file is expired. Even though the writes may not be that complicated with this approach but updating values when the length of value changes (example in join operation a value is an appending list) may result in many small stray files. Chandni
