[ 
https://issues.apache.org/jira/browse/PHOENIX-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989101#comment-16989101
 ] 

Geoffrey Jacoby commented on PHOENIX-5604:
------------------------------------------

[~kozdemir] - didn't check the possibility the mutation durability flag might 
not be honored down the line. Will do so. (Though if it's not honored, might as 
well remove the skip wal flag anyway)

To your other point, when I say out of sync, I am referring not to "what would 
clients see if they queried", but to the actual state of the table on disk. 
Replication in HBase is eventually consistent, but promises that state will 
converge eventually, and that it will keep trying forever to do so. Anything 
that can permanently cause a divergence (if no one queries to trigger the read 
repair) is violating that promise.

While IndexScrutiny would trigger read repair, any HBase level utilities such 
as VerifyReplication (or more sophisticated tools that try to compare checksums 
or table snapshots) would show consistency errors between primary and 
secondary. 

> Index rebuilds and read repairs should not skip WAL
> ---------------------------------------------------
>
>                 Key: PHOENIX-5604
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5604
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Geoffrey Jacoby
>            Assignee: Geoffrey Jacoby
>            Priority: Major
>         Attachments: PHOENIX-5604-4.x-HBase-1.5.patch
>
>
> Currently both Index read repairs and IndexTool build/rebuilds in the new 
> design continue to skip the WAL, following the same pattern the old Indexer 
> used. However, there are key differences between the old and new logic that 
> make this no longer the correct choice.
> First, recall that all HBase replication is based on tailing the WAL, and 
> that any transaction that skips the WAL doesn't get replicated. 
> In the old logic, the data table write (and WAL append) would be accompanied 
> by an IndexedKeyValue which would contain enough information to reconstitute 
> the index edit in the event of a failure before the index edit could be 
> committed. So skipping the WAL during recovery was _potentially_ OK, because 
> writing to the WAL would be redundant locally. (But that still seems to me 
> wrong in a case with replication, since I don't believe IndexedKeyValues are 
> replicated, since they use the "magic" METAFAMILY cf.)  
> In the new logic, on a normal write, we write to the index first (which will 
> go into a WAL), then the data table (into a potentially different RS's WAL), 
> and lastly the verified flag flip into the Index, into the original index 
> write's WAL. If something goes wrong with stage 2 or 3, read repair will fix 
> it, but if the repair action – whether a put or delete – doesn't go into the 
> WAL, a DR buddy of the index will be out of sync. 
> This is even more important on an async initial build of an index, where if I 
> understand right, there is no WAL append for the index write at all in the 
> current UngroupedAggregateRegionObserver rebuild logic. The same would be the 
> case of a rebuild of a new-style index in the event of non-Phoenix related 
> corruption (such as HDFS or raw HBase level). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to