[ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martijn van Groningen updated LUCENE-7304: ------------------------------------------ Attachment: LUCENE-7304.patch Updated the patch. Added a more tests and cleaned up a bit. To re-iterate what this patch does, this query uses both an indexed field and a doc values field. The doc values field is used when {{DocIdSetIterator#advance(...)}} is invoked to figure out what the first child is of a parent and then instruct the child iterator to advance to that first child. The doc values field has kind of the same purpose what the {{BitSet}} does for {{ToParentBlockJoinQuery}} query. The indexed field is used for normal forward advancing ({{DocIdSetIterator#nextDoc()}}). I'm still unsure if this query should also use a doc values field for forward advancing. Each child would then store the offset to the next child. The last child's offset would be zero, meaning the parent is the next document. I think the upside with only using doc values fields is that validating that the docid block structure is correct is easier. > Doc values based block join implementation > ------------------------------------------ > > Key: LUCENE-7304 > URL: https://issues.apache.org/jira/browse/LUCENE-7304 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Martijn van Groningen > Priority: Minor > Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, > LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, > LUCENE-7304.patch, LUCENE-7304.patch > > > At query time the block join relies on a bitset for finding the previous > parent doc during advancing the doc id iterator. On large indices these > bitsets can consume large amounts of jvm heap space. Also typically due the > nature how these bitsets are set, the 'FixedBitSet' implementation is used. > The idea I had was to replace the bitset usage by a numeric doc values field > that stores offsets. Each child doc stores how many docids it is from its > parent doc and each parent stores how many docids it is apart from its first > child. At query time this information can be used to perform the block join. > I think another benefit of this approach is that external tools can now > easily determine if a doc is part of a block of documents and perhaps this > also helps index time sorting? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org