Github user JamesRTaylor commented on a diff in the pull request:
https://github.com/apache/phoenix/pull/8#discussion_r16794301
--- Diff:
phoenix-core/src/main/java/org/apache/phoenix/iterate/DefaultParallelIteratorRegionSplitter.java
---
@@ -138,14 +146,10 @@ public boolean apply(HRegionLocation location) {
// split each region in s splits such that:
// s = max(x) where s * x < t
//
- // The idea is to align splits with region boundaries. If rows are
not evenly
- // distributed across regions, using this scheme compensates for
regions that
- // have more rows than others, by applying tighter splits and
therefore spawning
- // off more scans over the overloaded regions.
- int splitsPerRegion = getSplitsPerRegion(regions.size());
// Create a multi-map of ServerName to List<KeyRange> which we'll
use to round robin from to ensure
// that we keep each region server busy for each query.
- ListMultimap<HRegionLocation,KeyRange> keyRangesPerRegion =
ArrayListMultimap.create(regions.size(),regions.size() * splitsPerRegion);;
+ int splitsPerRegion = getSplitsPerRegion(regions.size());
+ ListMultimap<HRegionLocation,KeyRange> keyRangesPerRegion =
ArrayListMultimap.create(regions.size(),regions.size() * splitsPerRegion);
if (splitsPerRegion == 1) {
for (HRegionLocation region : regions) {
--- End diff --
Here's what I think we should do here:
- Store guideposts per column family. It's probably easiest if the PK is of
the following form:
<cf varchar not null><guidepost varbinary null>. I'm not sure there's any
value in using a VARBINARY ARRAY. We should just make sure that we can delete
the old guideposts and add the new ones easily.
- Here, you'd still want to loop through the regions as above, but you want
to get all guideposts for the column families involved in the query. Let's take
the simple case where there's only one. In that case, you'd intersect all the
region boundaries with the guideposts - this will be a bit easier if the
guideposts are sorted already. The set of intersections will be what gets
returned here.
- For the multi-column family case, I think we want to do the same
processing as above per column family and then we'll coalesce any overlapping
ranges.
- We have the intersect and coalesce methods you'll need in our KeyRange
class, so the code should be relatively small
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---