[jira] [Updated] (LUCENE-7304) Doc values based block join implementation
[ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-7304: -- Attachment: LUCENE-7304.patch Updated the patch. Added a more tests and cleaned up a bit. To re-iterate what this patch does, this query uses both an indexed field and a doc values field. The doc values field is used when {{DocIdSetIterator#advance(...)}} is invoked to figure out what the first child is of a parent and then instruct the child iterator to advance to that first child. The doc values field has kind of the same purpose what the {{BitSet}} does for {{ToParentBlockJoinQuery}} query. The indexed field is used for normal forward advancing ({{DocIdSetIterator#nextDoc()}}). I'm still unsure if this query should also use a doc values field for forward advancing. Each child would then store the offset to the next child. The last child's offset would be zero, meaning the parent is the next document. I think the upside with only using doc values fields is that validating that the docid block structure is correct is easier. > Doc values based block join implementation > -- > > Key: LUCENE-7304 > URL: https://issues.apache.org/jira/browse/LUCENE-7304 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Priority: Minor > Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, > LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, > LUCENE-7304.patch, LUCENE-7304.patch > > > At query time the block join relies on a bitset for finding the previous > parent doc during advancing the doc id iterator. On large indices these > bitsets can consume large amounts of jvm heap space. Also typically due the > nature how these bitsets are set, the 'FixedBitSet' implementation is used. > The idea I had was to replace the bitset usage by a numeric doc values field > that stores offsets. Each child doc stores how many docids it is from its > parent doc and each parent stores how many docids it is apart from its first > child. At query time this information can be used to perform the block join. > I think another benefit of this approach is that external tools can now > easily determine if a doc is part of a block of documents and perhaps this > also helps index time sorting? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7304) Doc values based block join implementation
[ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-7304: -- Attachment: LUCENE-7304.patch It has been a while, but I had some time to get back to this. Updated patch to all changes that have happened so far in master (iterator based doc values api, two phase query execution and score supplier). I ran the same performance test as before and due to doc values compression, the offset field now takes 337387 bytes instead of 839592 bytes before, which is good! I'm still thinking about other ways of encoding the block of documents. Right now the parent document have a doc values field with the offset to the first child docid. Instead child documents can have a doc values field with the offset to its parent docid. That way parent doc can be indexed first before the child docs. > Doc values based block join implementation > -- > > Key: LUCENE-7304 > URL: https://issues.apache.org/jira/browse/LUCENE-7304 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Priority: Minor > Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, > LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, > LUCENE-7304.patch > > > At query time the block join relies on a bitset for finding the previous > parent doc during advancing the doc id iterator. On large indices these > bitsets can consume large amounts of jvm heap space. Also typically due the > nature how these bitsets are set, the 'FixedBitSet' implementation is used. > The idea I had was to replace the bitset usage by a numeric doc values field > that stores offsets. Each child doc stores how many docids it is from its > parent doc and each parent stores how many docids it is apart from its first > child. At query time this information can be used to perform the block join. > I think another benefit of this approach is that external tools can now > easily determine if a doc is part of a block of documents and perhaps this > also helps index time sorting? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7304) Doc values based block join implementation
[ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-7304: -- Attachment: LUCENE_7304.patch Changed the block join query to only require that parent docs store how far away there first child doc is (in docids). The reduces the amount of information required to be stored in the doc values offset field and these offsets for the parents compress better the offset values before (which was composed out of more information). I tested this patch out on a test data set (https://archive.org/download/stackexchange/english.stackexchange.com.7z). I extracted the questions, answers and comment and indexed each question with its answers and related comments as a hierarchical block of documents. In total 745252 docs were indexed. The size of the doc values offset field was 839592 bytes. After that I ran a query that selects all questions that have answers with comments (questions -> answers -> comments) for both the current block join and doc value block join. The the block join used 186768 bytes of jvm heap for bitsets and the doc values block join used 1132 bytes of jvm heap for references to the offset doc values field. So with the doc values approach, in total used roughly 4.5 times more RAM (assuming OS caches offset field), and the jvm memory footprint was roughly 165 times smaller. > Doc values based block join implementation > -- > > Key: LUCENE-7304 > URL: https://issues.apache.org/jira/browse/LUCENE-7304 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Priority: Minor > Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, > LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch > > > At query time the block join relies on a bitset for finding the previous > parent doc during advancing the doc id iterator. On large indices these > bitsets can consume large amounts of jvm heap space. Also typically due the > nature how these bitsets are set, the 'FixedBitSet' implementation is used. > The idea I had was to replace the bitset usage by a numeric doc values field > that stores offsets. Each child doc stores how many docids it is from its > parent doc and each parent stores how many docids it is apart from its first > child. At query time this information can be used to perform the block join. > I think another benefit of this approach is that external tools can now > easily determine if a doc is part of a block of documents and perhaps this > also helps index time sorting? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7304) Doc values based block join implementation
[ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Elschot updated LUCENE-7304: - Attachment: LUCENE-7304-20160606.patch Patch of 6 June 2016. This is the EliasFano code from LUCENE-5627 put into core. This has EliasFanoSequence implemented as EliasFanoBytes and as EliasFanoLongs, and an encoder and a decoder for these. The EliasFanoDocIdSet uses an EliasFanoLongs except when it is dense, in that case it uses a FixedBitSet. I added a getBitSet() method in this EliasFanoDocIdSet. I also added the test cases from LUCENE-5627, but I did not add a test for the getBitSet() method yet. It works as a DocIdSet, so as a BitSet should be no problem. EliasFanoDocIdSet could also be implemented on EliasFanoBytes, and it should be doable to put that in an index. At LUCENE-5627 EliasFanoBytes is used as a Payload already. > Doc values based block join implementation > -- > > Key: LUCENE-7304 > URL: https://issues.apache.org/jira/browse/LUCENE-7304 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Priority: Minor > Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, > LUCENE-7304-20160606.patch, LUCENE_7304.patch > > > At query time the block join relies on a bitset for finding the previous > parent doc during advancing the doc id iterator. On large indices these > bitsets can consume large amounts of jvm heap space. Also typically due the > nature how these bitsets are set, the 'FixedBitSet' implementation is used. > The idea I had was to replace the bitset usage by a numeric doc values field > that stores offsets. Each child doc stores how many docids it is from its > parent doc and each parent stores how many docids it is apart from its first > child. At query time this information can be used to perform the block join. > I think another benefit of this approach is that external tools can now > easily determine if a doc is part of a block of documents and perhaps this > also helps index time sorting? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7304) Doc values based block join implementation
[ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Elschot updated LUCENE-7304: - Attachment: LUCENE-7304-20160531.patch Patch of 31 May 2016. Adds DocBlocksIterator and uses it in ToChildBlockJoinQuery only. This is mostly an update of LUCENE-5092 to today, except that it does not include the ToParentBlockJoinQuery yet. To my surprise this compiles, but I did not run the tests in the join module. This is only to show a possible direction, BitSetProducer in the join queries may also need to be replaced by a DocBlocksIteratorProducer. > Doc values based block join implementation > -- > > Key: LUCENE-7304 > URL: https://issues.apache.org/jira/browse/LUCENE-7304 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Priority: Minor > Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, > LUCENE_7304.patch > > > At query time the block join relies on a bitset for finding the previous > parent doc during advancing the doc id iterator. On large indices these > bitsets can consume large amounts of jvm heap space. Also typically due the > nature how these bitsets are set, the 'FixedBitSet' implementation is used. > The idea I had was to replace the bitset usage by a numeric doc values field > that stores offsets. Each child doc stores how many docids it is from its > parent doc and each parent stores how many docids it is apart from its first > child. At query time this information can be used to perform the block join. > I think another benefit of this approach is that external tools can now > easily determine if a doc is part of a block of documents and perhaps this > also helps index time sorting? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7304) Doc values based block join implementation
[ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Elschot updated LUCENE-7304: - Attachment: LUCENE-5092-20140313.patch Patch for LUCENE-5092 against trunk of 13 March 2014. A.o. this adds method advanceToJustBefore() in EliasFanoDocIdSet. > Doc values based block join implementation > -- > > Key: LUCENE-7304 > URL: https://issues.apache.org/jira/browse/LUCENE-7304 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Priority: Minor > Attachments: LUCENE-5092-20140313.patch, LUCENE_7304.patch > > > At query time the block join relies on a bitset for finding the previous > parent doc during advancing the doc id iterator. On large indices these > bitsets can consume large amounts of jvm heap space. Also typically due the > nature how these bitsets are set, the 'FixedBitSet' implementation is used. > The idea I had was to replace the bitset usage by a numeric doc values field > that stores offsets. Each child doc stores how many docids it is from its > parent doc and each parent stores how many docids it is apart from its first > child. At query time this information can be used to perform the block join. > I think another benefit of this approach is that external tools can now > easily determine if a doc is part of a block of documents and perhaps this > also helps index time sorting? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7304) Doc values based block join implementation
[ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-7304: -- Attachment: LUCENE_7304.patch Attached a working version of a doc values based block join query. The app storing docs is responsible for adding the numeric doc values field with the right offsets. > Doc values based block join implementation > -- > > Key: LUCENE-7304 > URL: https://issues.apache.org/jira/browse/LUCENE-7304 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Martijn van Groningen >Priority: Minor > Attachments: LUCENE_7304.patch > > > At query time the block join relies on a bitset for finding the previous > parent doc during advancing the doc id iterator. On large indices these > bitsets can consume large amounts of jvm heap space. Also typically due the > nature how these bitsets are set, the 'FixedBitSet' implementation is used. > The idea I had was to replace the bitset usage by a numeric doc values field > that stores offsets. Each child doc stores how many docids it is from its > parent doc and each parent stores how many docids it is apart from its first > child. At query time this information can be used to perform the block join. > I think another benefit of this approach is that external tools can now > easily determine if a doc is part of a block of documents and perhaps this > also helps index time sorting? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org