Matt Corgan created HBASE-6093:
----------------------------------

             Summary: Flatten timestamps during flush and compaction
                 Key: HBASE-6093
                 URL: https://issues.apache.org/jira/browse/HBASE-6093
             Project: HBase
          Issue Type: New Feature
          Components: io, performance, regionserver
            Reporter: Matt Corgan
            Priority: Minor


Many applications run with maxVersions=1 and do not care about timestamps, or 
they will specify one timestamp per row as a normal KeyValue rather than 
per-cell.

Then, DataBlockEncoders like those in HBASE-4218 and HBASE-4676 often encode 
timestamps as diffs from the previous or diffs from the minimum timestamp in 
the block.  If all timestamps in a block are the same, they will all compress 
to basically <= 8 bytes total per block.  This can be 10% to 25% space savings 
for some schemas, and that savings is realized both on disk and in block cache.

We could add a ColumnFamily setting flattenTimestamps=[true/false].  If true, 
then all timestamps are modified during a flush/compaction to the 
currentTimeMillis() at the start of the flush/compaction.  If all timestamps 
are made identical in a file, then the encoder will be able to eliminate them.

The simplest use case is probably that where all inserts are type=Put, there 
are no overwrites, and there are no deletes.  As use cases get more complex, 
then so does the implementation.  

For example, what happens when there is a Put and a Delete of the same cell in 
the same memstore?  Maybe for a flush at t=flushStartTime, the Put gets 
timestamp=t, and the Delete gets timestamp=t+1.  Or maybe HBASE-4241 could take 
care of this problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to