[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2017-06-22 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7304:
--
Attachment: LUCENE-7304.patch

Updated the patch. Added a more tests and cleaned up a bit.

To re-iterate what this patch does, this query uses both an indexed field and a 
doc values field. The doc values field is used when 
{{DocIdSetIterator#advance(...)}} is invoked to figure out what the first child 
is of a parent and then instruct the child iterator to advance to that first 
child. The doc values field has kind of the same purpose what the {{BitSet}} 
does for {{ToParentBlockJoinQuery}} query. The indexed field is used for normal 
forward advancing ({{DocIdSetIterator#nextDoc()}}).

I'm still unsure if this query should also use a doc values field for forward 
advancing. Each child would then store the offset to the next child. The last 
child's offset would be zero, meaning the parent is the next document. I think 
the upside with only using doc values fields is that validating that the docid 
block structure is correct is easier.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, 
> LUCENE-7304.patch, LUCENE-7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2017-05-24 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7304:
--
Attachment: LUCENE-7304.patch

It has been a while, but I had some time to get back to this. Updated patch to 
all changes that have happened so far in master (iterator based doc values api, 
two phase query execution and score supplier).

I ran the same performance test as before and due to doc values compression, 
the offset field now takes 337387 bytes instead of 839592 bytes before, which 
is good!

I'm still thinking about other ways of encoding the block of documents. Right 
now the parent document have a doc values field with the offset to the first 
child docid. Instead child documents can have a doc values field with the 
offset to its parent docid. That way parent doc can be indexed first before the 
child docs.



> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, 
> LUCENE-7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2016-06-07 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7304:
--
Attachment: LUCENE_7304.patch

Changed the block join query to only require that parent docs store how far 
away there first child doc is (in docids).

The reduces the amount of information required to be stored in the doc values 
offset field and these offsets for the parents compress better the offset 
values before (which was composed out of more information).

I tested this patch out on a test data set 
(https://archive.org/download/stackexchange/english.stackexchange.com.7z). I 
extracted the questions, answers and comment and indexed each question with its 
answers and related comments as a hierarchical block of documents. In total 
745252 docs were indexed. The size of the doc values offset field was 839592 
bytes. 

After that I ran a query that selects all questions that have answers with 
comments (questions -> answers -> comments) for both the current block join and 
doc value block join. The the block join used 186768 bytes of jvm heap for 
bitsets and the doc values block join used 1132 bytes of jvm heap for 
references to the offset doc values field. 

So with the doc values approach, in total used roughly 4.5 times more RAM 
(assuming OS caches offset field), and the jvm memory footprint was roughly 165 
times smaller. 

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2016-06-06 Thread Paul Elschot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-7304:
-
Attachment: LUCENE-7304-20160606.patch

Patch of 6 June 2016.
This is the EliasFano code from  LUCENE-5627 put into core.

This has EliasFanoSequence implemented as EliasFanoBytes and as EliasFanoLongs, 
and an encoder and a decoder for these.

The EliasFanoDocIdSet uses an EliasFanoLongs except when it is dense, in that 
case it uses a FixedBitSet.

I added a getBitSet() method in this EliasFanoDocIdSet.

I also added the test cases from LUCENE-5627, but I did not add a test for the 
getBitSet() method yet. It works as a DocIdSet, so as a BitSet should be no 
problem.

EliasFanoDocIdSet could also be implemented on EliasFanoBytes, and it should be 
doable to put that in an index. At LUCENE-5627 EliasFanoBytes is used as a 
Payload already.


> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2016-05-31 Thread Paul Elschot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-7304:
-
Attachment: LUCENE-7304-20160531.patch

Patch of 31 May 2016.
Adds DocBlocksIterator and uses it in ToChildBlockJoinQuery only.

This is mostly an update of LUCENE-5092 to today, except that it does not 
include the ToParentBlockJoinQuery yet.

To my surprise this compiles, but I did not run the tests in the join module.

This is only to show a possible direction, BitSetProducer in the join queries 
may also need to be replaced by a DocBlocksIteratorProducer.


> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2016-05-30 Thread Paul Elschot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-7304:
-
Attachment: LUCENE-5092-20140313.patch

Patch for LUCENE-5092 against trunk of 13 March 2014.
A.o. this adds method advanceToJustBefore() in EliasFanoDocIdSet.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2016-05-26 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7304:
--
Attachment: LUCENE_7304.patch

Attached a working version of a doc values based block join query. 
The app storing docs is responsible for adding the numeric doc values field 
with the right offsets.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org