Good question, Gabriel. I believe that the deleted cells are cleaned
up after a second major compaction with the KEEP_DELETED_CELLS option
enabled. Lars H. implemented this option, so he can comment more, but
AFAIK he couldn't figure out how to get them to be collected on the
first major compaction. IMHO, this seems like a bug (but what do I
know, I'm not an HBase committer :-) ).

The time that KEEP_DELETED_CELLS is required is for flashback or
point-in-time queries. IMHO, without this option, HBase doesn't really
work correctly. Though you might argue "we never do that" and turn it
off, under-the-covers, Phoenix is doing point-in-time queries. If you
have a query that starts, at t1 and runs until t5, it won't see data
inserted after t1. Say a delete was done on a row at t2. Without the
KEEP_DELETED_CELLS being true, you'd potentially see this delete from
your query.

Perhaps the MVCC used by HBase should (does?) take care of this
automatically without us setting a max on the scan time range, but I'm
not sure. If it does, then we could likely not have this be the
default. We'd need to test this with the new ChunkedResultIterator as
well.

Maybe file a JIRA for further investigation?

Thanks,
James

On Wed, Jul 23, 2014 at 7:09 AM, Gabriel Reid <gabriel.r...@gmail.com> wrote:
> Hi,
>
> I noticed that HColumnDescriptor.KEEP_DELETED_CELLS is enabled by
> default on new Phoenix tables. This seems like a bit of an unexpected
> default, as it means (at least as far as I understand it) that data
> deleted with delete statements will never actually be cleared, even
> after a major compaction.
>
> Can anyone let me know what the reasoning is behind this? Any
> functional requirement within Phoenix that makes use of this default
> property (i.e. if I disable it in my DDL, is there anything that we
> know won't work then)? And then going further, is this something we
> definitely want to keep as a default?
>
> Thanks,
>
> Gabriel

Reply via email to