[GitHub] accumulo pull request #275: ACCUMULO-4667 Reworked the LocalityGroupIterator...

keith-turner Fri, 30 Jun 2017 12:14:15 -0700

Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/275#discussion_r125107098
  
    --- Diff: 
core/src/main/java/org/apache/accumulo/core/iterators/system/LocalityGroupIterator.java
 ---
    @@ -97,75 +133,116 @@ public static final int seek(HeapIterator hiter, 
LocalityGroup[] groups, Set<Byt
         else
           cfSet = Collections.emptySet();
     
    -    for (LocalityGroup lgr : groups) {
    -      // when include is set to true it means this locality groups contains
    -      // wanted column families
    -      boolean include = false;
    +    // determine the set of groups to use
    +    Collection<LocalityGroup> groupsToUse = Collections.EMPTY_LIST;
     
    -      if (cfSet.size() == 0) {
    -        include = !inclusive;
    -      } else if (lgr.isDefaultLocalityGroup && lgr.columnFamilies == null) 
{
    -        // do not know what column families are in the default locality 
group,
    -        // only know what column families are not in it
    +    // if no column families specified, then include all groups unless 
!inclusive
    +    if (cfSet.size() == 0) {
    +      if (!inclusive) {
    +        groupsToUse = groups.groups;
    +      }
    +    } else {
    +      groupsToUse = new HashSet<LocalityGroup>();
     
    +      // do not know what column families are in the default locality 
group,
    +      // only know what column families are not in it
    +      if (groups.defaultGroup != null) {
             if (inclusive) {
    -          if (!nonDefaultColumnFamilies.containsAll(cfSet)) {
    +          if (!groups.groupByCf.keySet().containsAll(cfSet)) {
                 // default LG may contain wanted and unwanted column families
    -            include = true;
    +            groupsToUse.add(groups.defaultGroup);
               }// else - everything wanted is in other locality groups, so 
nothing to do
             } else {
    -          // must include, if all excluded column families are in other 
locality groups
    -          // then there are not unwanted column families in default LG
    -          include = true;
    +          // must include the default group as it may include cfs not in 
our cfSet
    +          groupsToUse.add(groups.defaultGroup);
    +        }
    +      }
    +
    +      /*
    +       * Need to consider the following cases for inclusive and exclusive 
(lgcf:locality group column family set, cf:column family set) lgcf and cf are 
disjoint
    +       * lgcf and cf are the same cf contains lgcf lgcf contains cf lgccf 
and cf intersect but neither is a subset of the other
    +       */
    +      if (!inclusive) {
    +        for (Entry<ByteSequence,LocalityGroup> entry : 
groups.groupByCf.entrySet()) {
    +          if (!cfSet.contains(entry.getKey())) {
    +            groupsToUse.add(entry.getValue());
    +          }
    +        }
    +      } else if (groups.groupByCf.size() <= cfSet.size()) {
    +        for (Entry<ByteSequence,LocalityGroup> entry : 
groups.groupByCf.entrySet()) {
    +          if (cfSet.contains(entry.getKey())) {
    +            groupsToUse.add(entry.getValue());
    +          }
             }
           } else {
    -        /*
    -         * Need to consider the following cases for inclusive and 
exclusive (lgcf:locality group column family set, cf:column family set) lgcf 
and cf are
    -         * disjoint lgcf and cf are the same cf contains lgcf lgcf 
contains cf lgccf and cf intersect but neither is a subset of the other
    -         */
    -
    -        for (Entry<ByteSequence,MutableLong> entry : 
lgr.columnFamilies.entrySet())
    -          if (entry.getValue().longValue() > 0)
    -            if (cfSet.contains(entry.getKey())) {
    -              if (inclusive)
    -                include = true;
    -            } else if (!inclusive) {
    -              include = true;
    -            }
    +        for (ByteSequence cf : cfSet) {
    +          LocalityGroup group = groups.groupByCf.get(cf);
    +          if (group != null) {
    +            groupsToUse.add(group);
    +          }
    +        }
           }
    +    }
     
    -      if (include) {
    -        lgr.getIterator().seek(range, EMPTY_CF_SET, false);
    -        hiter.addSource(lgr.getIterator());
    -        numLGSeeked++;
    -      }// every column family is excluded, zero count, or not present
    +    for (LocalityGroup lgr : groupsToUse) {
    +      lgr.getIterator().seek(range, EMPTY_CF_SET, false);
    +      hiter.addSource(lgr.getIterator());
    +      numLGSeeked++;
    +    }
    +
    +    if (used != null) {
    +      used.addAll(groupsToUse);
         }
     
         return numLGSeeked;
       }
     
       @Override
       public void seek(Range range, Collection<ByteSequence> columnFamilies, 
boolean inclusive) throws IOException {
    -    seek(this, groups, nonDefaultColumnFamilies, range, columnFamilies, 
inclusive);
    +    Set<ByteSequence> cfSet;
    +    if (columnFamilies.size() > 0)
    +      cfSet = new HashSet<>(columnFamilies);
    --- End diff --
    
    Always doing this copy is annoying. I wonder if the cost of this copy is 
made up by the cost savings of checking against the last families.  I suspect 
it may be, but not sure.   I think using the same set of columns for each seek 
is common, so its nice to use the same decision.  In the use case were seek is 
repeatedly being called with a different set of families, this copy and later 
equality check would be a lot of overhead with no gain. However I think this an 
exotic use case??
    
    I like Guava's `cfSet = ImmutableSet.of(columnFamilies)` because if the 
input is already an ImmutableSet it just returns it.   This would be beneficial 
if code calling seek passed in ImmutableSet.  We could avoid the copy without 
worrying about correctness.  Also another nice thing about using immutable set 
is the equality check below will be super quick (because its the same instance 
it can check for reference equality).  Comparing two different hash set 
instances that have the same content requires examining the content (I looked 
and it calls `containsAll()`).
    
    I am thinking we should change this to  `cfSet = 
ImmutableSet.of(columnFamilies)` and then have a follow on issue to make 
Accumulo code that constructs these sets in the tablet server use `ImmutableSet`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #275: ACCUMULO-4667 Reworked the LocalityGroupIterator...

Reply via email to