[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

Martijn van Groningen (JIRA) Thu, 22 Jun 2017 05:07:26 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Martijn van Groningen updated LUCENE-7304:
------------------------------------------
    Attachment: LUCENE-7304.patch

Updated the patch. Added a more tests and cleaned up a bit.

To re-iterate what this patch does, this query uses both an indexed field and a 
doc values field. The doc values field is used when 
{{DocIdSetIterator#advance(...)}} is invoked to figure out what the first child 
is of a parent and then instruct the child iterator to advance to that first 
child. The doc values field has kind of the same purpose what the {{BitSet}} 
does for {{ToParentBlockJoinQuery}} query. The indexed field is used for normal 
forward advancing ({{DocIdSetIterator#nextDoc()}}).

I'm still unsure if this query should also use a doc values field for forward 
advancing. Each child would then store the offset to the next child. The last 
child's offset would be zero, meaning the parent is the next document. I think 
the upside with only using doc values fields is that validating that the docid 
block structure is correct is easier.

> Doc values based block join implementation
> ------------------------------------------
>
>                 Key: LUCENE-7304
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7304
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Martijn van Groningen
>            Priority: Minor
>         Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, 
> LUCENE-7304.patch, LUCENE-7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

Reply via email to