[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

Paul Elschot (JIRA) Sat, 28 May 2016 12:12:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305550#comment-15305550
 ]


Paul Elschot commented on LUCENE-7304:
--------------------------------------

bq. ... go backwards ...  less than one bit per doc ...

Maybe it is time to have another look at EliasFanoDocIdSet, see LUCENE-6484.
It won't really fit doc values I think, for block joins this needs one set per 
segment.

I still have an EliasFanoDocIdSet that could be used for block joins, see 
LUCENE-5092.
In case there is interest in that please let me know, the github pull requests 
from that time did not survive the move to git.

See also these graphs on performance 
http://people.apache.org/~jpountz/doc_id_sets.html
Unfortunately RoaringDocIdSet is not shown in there, I'd expect that to be 
(easily made) bidirectional, too.



> Doc values based block join implementation
> ------------------------------------------
>
>                 Key: LUCENE-7304
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7304
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Martijn van Groningen
>            Priority: Minor
>         Attachments: LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

Reply via email to