gsmiller commented on a change in pull request #509: URL: https://github.com/apache/lucene/pull/509#discussion_r774170371
########## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ########## @@ -471,19 +471,24 @@ private void indexDrillDownTerms( private void processSSDVFacetFields( Map<String, List<SortedSetDocValuesFacetField>> byField, Document doc) { + Set<String> addedPaths = new HashSet<>(); Review comment: So you shouldn't need to actually de-dupe. `SortedSetDocValues` will do this for you (i.e., it will only add each unique value one to each document when it goes to actually index it, even if you've added it multiple times). I'd suggest delegating to SSDV to do this de-duping and just keep this code simple. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -79,37 +86,78 @@ public DefaultSortedSetDocValuesReaderState(IndexReader reader, String field) th } valueCount = (int) dv.getValueCount(); - // TODO: we can make this more efficient if eg we can be - // "involved" when OrdinalMap is being created? Ie see - // each term/ord it's assigning as it goes... - String lastDim = null; - int startOrd = -1; + // keeps track of last path with length i where the path has an unfilled sibling + Map<Integer, OrdAndComponent> pathsWithUnfilledSiblings = new HashMap<>(); + + BytesRef nextTerm = null; + String[] nextComponents = null; - // TODO: this approach can work for full hierarchy?; - // TaxoReader can't do this since ords are not in - // "sorted order" ... but we should generalize this to - // support arbitrary hierarchy: for (int ord = 0; ord < valueCount; ord++) { - final BytesRef term = dv.lookupOrd(ord); - String[] components = FacetsConfig.stringToPath(term.utf8ToString()); - if (components.length != 2) { - throw new IllegalArgumentException( - "this class can only handle 2 level hierarchy (dim/value); got: " - + Arrays.toString(components) - + " " - + term.utf8ToString()); + if (children == null) { + children = new FixedBitSet(valueCount); + siblings = new int[valueCount]; + } + + String[] components; + + if (nextTerm == null) { + BytesRef term = dv.lookupOrd(ord); + components = FacetsConfig.stringToPath(term.utf8ToString()); + } else { + components = nextComponents; + } + + if (components.length == 1) { + // current ord is a dim + if (dims == null) { + dims = new ArrayList<>(); + } + dims.add(new DimAndOrd(components[0], ord)); } - if (!components[0].equals(lastDim)) { - if (lastDim != null) { - prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1)); + + if (pathsWithUnfilledSiblings.containsKey(components.length - 1)) { Review comment: I think you could solve this with a stack instead of a set. Each time the path length increases, you could put an entry on the stack containing the last path at that length. Then whenever the path length decreases from one ordinal to the next, you could pop the stack and examine the entries to see if you need to "connect" the siblings. Would that make sense? I might sketch out a suggested algorithm if I have some. Maybe this approach is easier to read but a stack is immediately what my brain jumped to when I thought about this problem. Thoughts? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -79,37 +86,78 @@ public DefaultSortedSetDocValuesReaderState(IndexReader reader, String field) th } valueCount = (int) dv.getValueCount(); - // TODO: we can make this more efficient if eg we can be - // "involved" when OrdinalMap is being created? Ie see - // each term/ord it's assigning as it goes... - String lastDim = null; - int startOrd = -1; + // keeps track of last path with length i where the path has an unfilled sibling + Map<Integer, OrdAndComponent> pathsWithUnfilledSiblings = new HashMap<>(); + + BytesRef nextTerm = null; + String[] nextComponents = null; - // TODO: this approach can work for full hierarchy?; - // TaxoReader can't do this since ords are not in - // "sorted order" ... but we should generalize this to - // support arbitrary hierarchy: for (int ord = 0; ord < valueCount; ord++) { - final BytesRef term = dv.lookupOrd(ord); - String[] components = FacetsConfig.stringToPath(term.utf8ToString()); - if (components.length != 2) { - throw new IllegalArgumentException( - "this class can only handle 2 level hierarchy (dim/value); got: " - + Arrays.toString(components) - + " " - + term.utf8ToString()); + if (children == null) { + children = new FixedBitSet(valueCount); + siblings = new int[valueCount]; + } + + String[] components; + + if (nextTerm == null) { + BytesRef term = dv.lookupOrd(ord); + components = FacetsConfig.stringToPath(term.utf8ToString()); + } else { + components = nextComponents; + } + + if (components.length == 1) { + // current ord is a dim + if (dims == null) { + dims = new ArrayList<>(); + } + dims.add(new DimAndOrd(components[0], ord)); } - if (!components[0].equals(lastDim)) { - if (lastDim != null) { - prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1)); + + if (pathsWithUnfilledSiblings.containsKey(components.length - 1)) { + OrdAndComponent possibleSibling = pathsWithUnfilledSiblings.get(components.length - 1); + boolean isSibling = true; + for (int i = 0; i < possibleSibling.component.length - 1; i++) { + if (!possibleSibling.component[i].equals(components[i])) { + isSibling = false; + break; + } + } + if (isSibling) { + siblings[possibleSibling.ord] = ord; + } else { + siblings[possibleSibling.ord] = INVALID_ORDINAL; } - startOrd = ord; - lastDim = components[0]; + pathsWithUnfilledSiblings.remove(components.length - 1); + } + + if (ord + 1 == valueCount) { + // last value, cannot have children or links to more siblings + siblings[ord] = INVALID_ORDINAL; + break; + } + + nextTerm = dv.lookupOrd(ord + 1); + nextComponents = FacetsConfig.stringToPath(nextTerm.utf8ToString()); + + if (components.length < nextComponents.length) { + // next ord must be a direct child of current ord + children.set(ord); Review comment: Oh, I like this! I was originally thinking you'd need an `int[]` to track the first child for each ordinal (much like taxonomy faceting does) but it's a smart observation that the first child will always have to be one value greater than the parent ordinal for the SSDV case. Nice way to save some memory! Very cool :) ########## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ########## @@ -471,19 +471,24 @@ private void indexDrillDownTerms( private void processSSDVFacetFields( Map<String, List<SortedSetDocValuesFacetField>> byField, Document doc) { + Set<String> addedPaths = new HashSet<>(); + for (Map.Entry<String, List<SortedSetDocValuesFacetField>> ent : byField.entrySet()) { String indexFieldName = ent.getKey(); for (SortedSetDocValuesFacetField facetField : ent.getValue()) { - FacetLabel facetLabel = new FacetLabel(facetField.dim, facetField.label); - String fullPath = pathToString(facetLabel.components, facetLabel.length); - - // For facet counts: - doc.add(new SortedSetDocValuesField(indexFieldName, new BytesRef(fullPath))); - - // For drill-down: - indexDrillDownTerms(doc, indexFieldName, getDimConfig(facetField.dim), facetLabel); + FacetLabel facetLabel = new FacetLabel(facetField.dim, facetField.path); + for (int i = 0; i < facetLabel.length; i++) { Review comment: Is the only reason for indexing all the ancestor paths to make sure counts for multi-valued docs work correctly? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -79,37 +86,78 @@ public DefaultSortedSetDocValuesReaderState(IndexReader reader, String field) th } valueCount = (int) dv.getValueCount(); - // TODO: we can make this more efficient if eg we can be - // "involved" when OrdinalMap is being created? Ie see - // each term/ord it's assigning as it goes... - String lastDim = null; - int startOrd = -1; + // keeps track of last path with length i where the path has an unfilled sibling + Map<Integer, OrdAndComponent> pathsWithUnfilledSiblings = new HashMap<>(); + + BytesRef nextTerm = null; + String[] nextComponents = null; - // TODO: this approach can work for full hierarchy?; - // TaxoReader can't do this since ords are not in - // "sorted order" ... but we should generalize this to - // support arbitrary hierarchy: for (int ord = 0; ord < valueCount; ord++) { - final BytesRef term = dv.lookupOrd(ord); - String[] components = FacetsConfig.stringToPath(term.utf8ToString()); - if (components.length != 2) { - throw new IllegalArgumentException( - "this class can only handle 2 level hierarchy (dim/value); got: " - + Arrays.toString(components) - + " " - + term.utf8ToString()); + if (children == null) { + children = new FixedBitSet(valueCount); + siblings = new int[valueCount]; + } + + String[] components; + + if (nextTerm == null) { + BytesRef term = dv.lookupOrd(ord); + components = FacetsConfig.stringToPath(term.utf8ToString()); + } else { + components = nextComponents; + } + + if (components.length == 1) { + // current ord is a dim + if (dims == null) { + dims = new ArrayList<>(); + } + dims.add(new DimAndOrd(components[0], ord)); } - if (!components[0].equals(lastDim)) { - if (lastDim != null) { - prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1)); + + if (pathsWithUnfilledSiblings.containsKey(components.length - 1)) { + OrdAndComponent possibleSibling = pathsWithUnfilledSiblings.get(components.length - 1); + boolean isSibling = true; + for (int i = 0; i < possibleSibling.component.length - 1; i++) { + if (!possibleSibling.component[i].equals(components[i])) { + isSibling = false; + break; + } + } + if (isSibling) { + siblings[possibleSibling.ord] = ord; + } else { + siblings[possibleSibling.ord] = INVALID_ORDINAL; } - startOrd = ord; - lastDim = components[0]; + pathsWithUnfilledSiblings.remove(components.length - 1); + } + + if (ord + 1 == valueCount) { + // last value, cannot have children or links to more siblings + siblings[ord] = INVALID_ORDINAL; + break; + } + + nextTerm = dv.lookupOrd(ord + 1); + nextComponents = FacetsConfig.stringToPath(nextTerm.utf8ToString()); + + if (components.length < nextComponents.length) { + // next ord must be a direct child of current ord Review comment: This is only true because all ancestry paths are explicitly indexed right? Otherwise it could be possible that the next value is actually a "grandchild" (or even further descended)? Might be worth mentioning this in the comments. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -46,13 +48,18 @@ private final String field; private final int valueCount; + // denotes whether next ord is child or not + private FixedBitSet children; Review comment: This comment and variable name were a bit confusing to me. From looking through the code, it seems like this bitset is really storing whether-or-not a given ordinal has at least one child right? So a set bit means "yes, there is at least one child" and then it's implied that the first child is the next ordinal? To get the rest of the children, you have to start traversing the siblings? Maybe a different name like `hasChildren` might be helpful along with a comment that's a little more descriptive? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -79,37 +86,78 @@ public DefaultSortedSetDocValuesReaderState(IndexReader reader, String field) th } valueCount = (int) dv.getValueCount(); - // TODO: we can make this more efficient if eg we can be - // "involved" when OrdinalMap is being created? Ie see - // each term/ord it's assigning as it goes... - String lastDim = null; - int startOrd = -1; + // keeps track of last path with length i where the path has an unfilled sibling + Map<Integer, OrdAndComponent> pathsWithUnfilledSiblings = new HashMap<>(); + + BytesRef nextTerm = null; + String[] nextComponents = null; - // TODO: this approach can work for full hierarchy?; - // TaxoReader can't do this since ords are not in - // "sorted order" ... but we should generalize this to - // support arbitrary hierarchy: for (int ord = 0; ord < valueCount; ord++) { - final BytesRef term = dv.lookupOrd(ord); - String[] components = FacetsConfig.stringToPath(term.utf8ToString()); - if (components.length != 2) { - throw new IllegalArgumentException( - "this class can only handle 2 level hierarchy (dim/value); got: " - + Arrays.toString(components) - + " " - + term.utf8ToString()); + if (children == null) { + children = new FixedBitSet(valueCount); + siblings = new int[valueCount]; + } + + String[] components; + + if (nextTerm == null) { + BytesRef term = dv.lookupOrd(ord); + components = FacetsConfig.stringToPath(term.utf8ToString()); + } else { + components = nextComponents; + } + + if (components.length == 1) { + // current ord is a dim + if (dims == null) { + dims = new ArrayList<>(); + } + dims.add(new DimAndOrd(components[0], ord)); } - if (!components[0].equals(lastDim)) { - if (lastDim != null) { - prefixToOrdRange.put(lastDim, new OrdRange(startOrd, ord - 1)); + + if (pathsWithUnfilledSiblings.containsKey(components.length - 1)) { + OrdAndComponent possibleSibling = pathsWithUnfilledSiblings.get(components.length - 1); + boolean isSibling = true; + for (int i = 0; i < possibleSibling.component.length - 1; i++) { + if (!possibleSibling.component[i].equals(components[i])) { Review comment: We generally use `== false` instead of `!` in this codebase to improve readability. So in this case, it would be good to change to `possibleSibling.component[i].equals(components[i]) == false` ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesReaderState.java ########## @@ -37,19 +38,102 @@ public abstract class SortedSetDocValuesReaderState implements Accountable { /** - * Holds start/end range of ords, which maps to one dimension (someday we may generalize it to map - * to hierarchies within one dimension). + * Class to store ranges of facet label as well as references to children of that label, key'd by + * label string */ - public static final class OrdRange { + public static final class HierarchicalOrdRange { Review comment: As I mentioned elsewhere, I'm not convinced we need to maintain this tree structure, allocating all these separate maps, requiring them to be ordered, etc. Can we not maintain a single map of "path" -> ordinal range? As far as I can tell, when we need to look up the ordinal range for a value we already have a full path and should just be able to do a simple key/value lookup in a flat map to obtain the range for that path right? I think a single map without all this tree structure might be simpler and less costly w.r.t. allocating a bunch of maps, etc. What do you think? Am I maybe overlooking some fundamental reason we need to maintain this tree? ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -222,4 +258,59 @@ public IndexReader getReader() { public int getSize() { return valueCount; } + + @Override + public Iterable<Integer> childOrds(int pathOrd) { Review comment: What about `PrimitiveIterator.OfInt` here instead? It would be nice to avoid boxing all those ints. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java ########## @@ -95,29 +92,31 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I if (topN <= 0) { throw new IllegalArgumentException("topN must be > 0 (got: " + topN + ")"); } - if (path.length > 0) { - throw new IllegalArgumentException("path should be 0 length"); - } - OrdRange ordRange = state.getOrdRange(dim); - if (ordRange == null) { - return null; // means dimension was never indexed + + int pathOrd = (int) dv.lookupTerm(new BytesRef(FacetsConfig.pathToString(dim, path))); + if (pathOrd == -1) { + // path was never indexed + return null; } - return getDim(dim, ordRange, topN); + + Iterable<Integer> childOrds = state.childOrds(pathOrd); + + return getDim(dim, path, pathOrd, childOrds, topN); } - private FacetResult getDim(String dim, OrdRange ordRange, int topN) throws IOException { + private FacetResult getDim( + String dim, String[] path, int pathOrd, Iterable<Integer> childOrds, int topN) + throws IOException { TopOrdAndIntQueue q = null; int bottomCount = 0; - int dimCount = 0; Review comment: I like that this also fixes the bug tracked in LUCENE-9952. If you end up needing to plumb the facet configuration in here for back-compat (I think you might), then it would be good to only do this in cases where it's necessary. Taxo faceting is a good example of this. The code is a little burdensome due to having slightly different aggregation implementations depending on how the dim is configured, but it lets the index stay as small as possible. So in this particular case, if a user has single-valued dims that aren't hierarchical, we hopefully won't need to index the dims explicitly and can continue to aggregate them with this logic as before. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -222,4 +258,59 @@ public IndexReader getReader() { public int getSize() { return valueCount; } + + @Override + public Iterable<Integer> childOrds(int pathOrd) { + return () -> + new Iterator<>() { + + boolean atStart = true; + int currentOrd = pathOrd; + + @Override + public boolean hasNext() { + if (atStart) { + if (currentOrd < 0 || currentOrd >= children.length()) { + return false; + } + return children.get(currentOrd); + } else { + return siblings[currentOrd] != INVALID_ORDINAL; + } + } + + @Override + public Integer next() { + if (atStart) { + if (currentOrd < 0 || currentOrd >= children.length()) { + return INVALID_ORDINAL; + } + atStart = false; + if (children.get(currentOrd)) { + return ++currentOrd; Review comment: A lot of people find this notation a bit difficult to read. We usually do something like: ``` currentOrd++; return currentOrd; ``` ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java ########## @@ -317,26 +317,23 @@ public Number getSpecificValue(String dim, String... path) throws IOException { public List<FacetResult> getAllDims(int topN) throws IOException { List<FacetResult> results = new ArrayList<>(); - for (Map.Entry<String, OrdRange> ent : state.getPrefixToOrdRange().entrySet()) { - FacetResult fr = getDim(ent.getKey(), ent.getValue(), topN); + for (DimAndOrd dim : state.getDims()) { + Iterable<Integer> childOrds = state.childOrds(dim.ord); + FacetResult fr = getDim(dim.dim, new String[0], dim.ord, childOrds, topN); if (fr != null) { results.add(fr); } } // Sort by highest count: - Collections.sort( - results, - new Comparator<FacetResult>() { - @Override - public int compare(FacetResult a, FacetResult b) { - if (a.value.intValue() > b.value.intValue()) { - return -1; - } else if (b.value.intValue() > a.value.intValue()) { - return 1; - } else { - return a.dim.compareTo(b.dim); - } + results.sort( + (a, b) -> { Review comment: I've seen past feedback to generally avoid lambdas in favor of anonymous classes. I think @rmuir has brought this up in the past. I'm no expert on lambdas but I think there's a class generation cost that ends up getting paid at runtime? Maybe Robert can chime in with thoughts on lambdas, but I tend to avoid them in this codebase. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesReaderState.java ########## @@ -36,23 +35,21 @@ */ public abstract class SortedSetDocValuesReaderState implements Accountable { - /** - * Holds start/end range of ords, which maps to one dimension (someday we may generalize it to map - * to hierarchies within one dimension). - */ - public static final class OrdRange { - /** Start of range, inclusive: */ - public final int start; - /** End of range, inclusive: */ - public final int end; + /** Holder class for a dimension along with it's corresponding ordinal */ + public static class DimAndOrd { Review comment: I'd make this `final` ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -46,13 +48,18 @@ private final String field; private final int valueCount; + // denotes whether next ord is child or not + private FixedBitSet children; + // maps an ord to its first sibling + private int[] siblings; Review comment: This could get fairly expensive with high-cardinality dimensions. I'd at least add a note/TODO here to see if there's a more efficient way to go about doing this to avoid the memory consumption. Also, I know you're still thinking about backwards-compatibility, but it would be really nice to not incur this memory overhead for non-hierarchical cases to avoid existing users all the sudden seeing this nasty increase in memory usage with SSDV faceting. ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java ########## @@ -46,13 +48,18 @@ private final String field; private final int valueCount; + // denotes whether next ord is child or not + private FixedBitSet children; Review comment: I'd be in favor of initializing `children`, `siblings` and `dims` up front and making them `final` instead of lazily initializing them below. The only case where you're saving initialization is if there are no actual values in the SSDV field. I suspect that can't even happen, but if it can, seems incredibly rare (and useless?). ########## File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesReaderState.java ########## @@ -36,23 +35,21 @@ */ public abstract class SortedSetDocValuesReaderState implements Accountable { - /** - * Holds start/end range of ords, which maps to one dimension (someday we may generalize it to map - * to hierarchies within one dimension). - */ - public static final class OrdRange { - /** Start of range, inclusive: */ - public final int start; - /** End of range, inclusive: */ - public final int end; + /** Holder class for a dimension along with it's corresponding ordinal */ + public static class DimAndOrd { + String dim; + int ord; - /** Start and end are inclusive. */ - public OrdRange(int start, int end) { - this.start = start; - this.end = end; + /** sole constructor */ + public DimAndOrd(String dim, int ord) { + this.dim = dim; + this.ord = ord; } } + /** Invalid ordinal const */ + public static int INVALID_ORDINAL = -1; Review comment: What about reusing `SortedSetDocValues#NO_MORE_ORDS` for this purpose instead of defining a new constant? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org