[ https://issues.apache.org/jira/browse/HBASE-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838382#action_12838382 ]
Todd Lipcon commented on HBASE-2265: ------------------------------------ bq. With a big spread of timestamps and keys, we wouldnt get much of an optimization Exactly. If users are writing out of order, they cannot take advantage of the optimization of culling older storage. As you mentioned, bloom filters help here. For users who are writing in order, the performance should be identical today. I think this is exactly what we want. bq. for a complete column family get, we'll have to touch every file, every time. This is because you are never sure if the next file contains another key/value for the result. A bloom filter would help here Yep, and this is exactly what I would expect. Why should a column family get _not_ touch all of the files? bq. However, during a compaction, this information is collapsed, and we end up with the duplicate key/values sitting next to each other. We might be able to cause/create an invariant that during compaction the 'newer' one comes first It's probably worth getting consensus, but I think it would be acceptable behavior to only retain the keyval from the newest storage when the timestamps are equal. That is, if I write A:ts=1, B:ts=2, C:ts=3, D:ts=3, E:ts=3, and want to retain "latest 3", I'd end up getting writes A, B, and E. bq. Generally the ideal solution would involve no change to the KeyValue serialization format I agree, and I think this can be done using only the existing metadata fields without any change per-keyvalue. > HFile and Memstore should maintain minimum and maximum timestamps > ----------------------------------------------------------------- > > Key: HBASE-2265 > URL: https://issues.apache.org/jira/browse/HBASE-2265 > Project: Hadoop HBase > Issue Type: Improvement > Components: regionserver > Reporter: Todd Lipcon > > In order to fix HBASE-1485 and HBASE-29, it would be very helpful to have > HFile and Memstore track their maximum and minimum timestamps. This has the > following nice properties: > - for a straight Get, if an entry has been already been found with timestamp > X, and X >= HFile.maxTimestamp, the HFile doesn't need to be checked. Thus, > the current fast behavior of get can be maintained for those who use strictly > increasing timestamps, but "correct" behavior for those who sometimes write > out-of-order. > - for a scan, the "latest timestamp" of the storage can be used to decide > which cell wins, even if the timestamp of the cells is equal. In essence, > rather than comparing timestamps, instead you are able to compare tuples of > (row timestamp, storage.max_timestamp) > - in general, min_timestamp(storage A) >= max_timestamp(storage B) if storage > A was flushed after storage B. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.