[ 
https://issues.apache.org/jira/browse/PHOENIX-4997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668108#comment-16668108
 ] 

Karan Mehta commented on PHOENIX-4997:
--------------------------------------

Adding few more details here for visibility. 

MapReduceParallelScanGrouper.java has a method called 
`getRegionLocationsFromManifest()`, which reads snapshotManifest files and 
determines the regions for the snapshot. However these regions may not be 
sorted by startKey. The consumer of this function, `getParallelScans()` expect 
it to be sorted. If it is not sorted, it may result in incorrect scan 
boundaries or overlapping scan boundaries. 

The PR also fixes the issue by first sorting the HRegionInfo before returning 
the list of all HRegionLocation.

> Phoenix MR on snapshots can produce duplicate rows
> --------------------------------------------------
>
>                 Key: PHOENIX-4997
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4997
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Karan Mehta
>            Assignee: Karan Mehta
>            Priority: Major
>         Attachments: PHOENIX-4997.master.001.patch
>
>
> Phoenix MR over snapshots uses TableSnapshotResultIterator and 
> SnapshotScanner classes for iterating/scanning over snapshots. They had been 
> copied over from HBase classes TableSnapshotScanner and 
> ClientSideRegionScanner classes and modified according to Phoenix 
> requirements. This decision was taken since some of fields of these classes 
> were private and hence it is not possible to reuse them. HBASE-8369 is the 
> main Jira.
> The framework had a bug which was fixed as part of HBASE-16011. However the 
> fix was not ported to Phoenix and hence Phoenix MR over snapshots still 
> continues to have it. This Jira is to fix that issue.
> FYI [~akshita.malhotra] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to