Chandni,

The approach that we discussed was to organize buckets (not keys) by time,
which allows you to discard the entire file on purge.

In addition, when you use a block indexed file format, there is no need to
load the entire file to retrieve a single key.

Thomas



On Sat, Nov 28, 2015 at 9:42 AM, Chandni Singh <[email protected]>
wrote:

> Another approach is to treat an entry in bucket data file as:
> <time><key><value>
>
> Time can be extract from the tuple (or windowId can be used).
> With this approach purging can be simple. For each Bucket Data File we
> check the last entry (since data is sorted in the bucket data file) and
> delete if that is expired.
>
> Writing to bucket data file can be simple. We will not update value of a
> key, but always add a new entry for the key when its value changes.
> Cons- multiple entries for a key.
> If the tuples are not out of order then we may never have to re-write a
> bucket data file that is complete.
>
> Reading is a problem here. The whole bucket needs to be de-serialized to
> find a key since data is no longer sorted on disk. If the query for a key
> specifies a time range then that read can be optimized.
>
>
> With Tim's approach, purging can be triggered asynchronously at regular
> intervals that may even delete data file which hasn't been updated for
> sometime and the latest entry in that file is expired.
> Even though the writes may not be that complicated with this approach but
> updating values when the length of value changes (example in join operation
> a value is an appending list)  may result in many small stray files.
>
> Chandni
>

Reply via email to