[jira] [Commented] (LUCENE-3171) BlockJoinQuery/Collector

2012-05-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273926#comment-13273926
 ] 

Michael McCandless commented on LUCENE-3171:


I wrote this blog post giving a quick overview: 
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html

> BlockJoinQuery/Collector
> 
>
> Key: LUCENE-3171
> URL: https://issues.apache.org/jira/browse/LUCENE-3171
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: Michael McCandless
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-3171.patch, LUCENE-3171.patch, LUCENE-3171.patch
>
>
> I created a single-pass Query + Collector to implement nested docs.
> The approach is similar to LUCENE-2454, in that the app must index
> documents in "join order", as a block (IW.add/updateDocuments), with
> the parent doc at the end of the block, except that this impl is one
> pass.
> Once you join at indexing time, you can take any query that matches
> child docs and join it up to the parent docID space, using
> BlockJoinQuery.  You then use BlockJoinCollector, which sorts parent
> docs by provided Sort, to gather results, grouped by parent; this
> collector finds any BlockJoinQuerys (using Scorer.visitScorers) and
> retains the child docs corresponding to each collected parent doc.
> After searching is done, you retrieve the TopGroups from a provided
> BlockJoinQuery.
> Like LUCENE-2454, this is less general than the arbitrary joins in
> Solr (SOLR-2272) or parent/child from ElasticSearch
> (https://github.com/elasticsearch/elasticsearch/issues/553), since you
> must do the join at indexing time as a doc block, but it should be
> able to handle nested joins as well as joins to multiple tables,
> though I don't yet have test cases for these.
> I put this in a new Join module (modules/join); I think as we
> refactor join impls we should put them here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3171) BlockJoinQuery/Collector

2012-05-11 Thread David Webb (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273759#comment-13273759
 ] 

David Webb commented on LUCENE-3171:


Is there a wiki page on how to use this?  I need to implement an index with 
nested docs and an example scheme and query would be awesome. Thanks!

> BlockJoinQuery/Collector
> 
>
> Key: LUCENE-3171
> URL: https://issues.apache.org/jira/browse/LUCENE-3171
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: Michael McCandless
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-3171.patch, LUCENE-3171.patch, LUCENE-3171.patch
>
>
> I created a single-pass Query + Collector to implement nested docs.
> The approach is similar to LUCENE-2454, in that the app must index
> documents in "join order", as a block (IW.add/updateDocuments), with
> the parent doc at the end of the block, except that this impl is one
> pass.
> Once you join at indexing time, you can take any query that matches
> child docs and join it up to the parent docID space, using
> BlockJoinQuery.  You then use BlockJoinCollector, which sorts parent
> docs by provided Sort, to gather results, grouped by parent; this
> collector finds any BlockJoinQuerys (using Scorer.visitScorers) and
> retains the child docs corresponding to each collected parent doc.
> After searching is done, you retrieve the TopGroups from a provided
> BlockJoinQuery.
> Like LUCENE-2454, this is less general than the arbitrary joins in
> Solr (SOLR-2272) or parent/child from ElasticSearch
> (https://github.com/elasticsearch/elasticsearch/issues/553), since you
> must do the join at indexing time as a doc block, but it should be
> able to handle nested joins as well as joins to multiple tables,
> though I don't yet have test cases for these.
> I put this in a new Join module (modules/join); I think as we
> refactor join impls we should put them here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3171) BlockJoinQuery/Collector

2011-06-26 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055096#comment-13055096
 ] 

Michael McCandless commented on LUCENE-3171:


bq. The possible inefficiency is the same as the one for a any sparsely filled 
OpenBitSet.

Ahh, OK.  Though, I suspect this (the linear scan OBS does for next/prevSetBit) 
is a minor cost overall, if indeed the app has so many child docs per parent 
that a sparse bit set would be warranted?  Ie, the Query/Collector would still 
be visiting these many child docs per parent, I guess?  (Unless the query hits 
few results).

I don't think a jdoc warning is really required for this... but I'm fine if you 
want to add one?

I'll commit this soon and resolve LUCENE-2454 as duplicate!

> BlockJoinQuery/Collector
> 
>
> Key: LUCENE-3171
> URL: https://issues.apache.org/jira/browse/LUCENE-3171
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: Michael McCandless
> Fix For: 3.3, 4.0
>
> Attachments: LUCENE-3171.patch, LUCENE-3171.patch, LUCENE-3171.patch
>
>
> I created a single-pass Query + Collector to implement nested docs.
> The approach is similar to LUCENE-2454, in that the app must index
> documents in "join order", as a block (IW.add/updateDocuments), with
> the parent doc at the end of the block, except that this impl is one
> pass.
> Once you join at indexing time, you can take any query that matches
> child docs and join it up to the parent docID space, using
> BlockJoinQuery.  You then use BlockJoinCollector, which sorts parent
> docs by provided Sort, to gather results, grouped by parent; this
> collector finds any BlockJoinQuerys (using Scorer.visitScorers) and
> retains the child docs corresponding to each collected parent doc.
> After searching is done, you retrieve the TopGroups from a provided
> BlockJoinQuery.
> Like LUCENE-2454, this is less general than the arbitrary joins in
> Solr (SOLR-2272) or parent/child from ElasticSearch
> (https://github.com/elasticsearch/elasticsearch/issues/553), since you
> must do the join at indexing time as a doc block, but it should be
> able to handle nested joins as well as joins to multiple tables,
> though I don't yet have test cases for these.
> I put this in a new Join module (modules/join); I think as we
> refactor join impls we should put them here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3171) BlockJoinQuery/Collector

2011-06-21 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052770#comment-13052770
 ] 

Paul Elschot commented on LUCENE-3171:
--

The possible inefficiency is the same as the one for a any sparsely filled 
OpenBitSet.

Another implementation (should be another issue, but since you asked...) could 
be a set of increasing integers, based on a balanced tree structure with a 
moderate fanout (e.g. 32), and all integer values relative to the minimum 
determined by the data for the pointer from the parent. The whole thing could 
be stored in one int[], the pointers would be (forward) indexes into this one 
array, and each internal node would consist of two rows of integers (one data, 
one pointers), and each row would be compressed as a frame of reference into 
the array.

This thing can implement {code}int next(int x){code} and {code}int previous(int 
x){code} easily, and an iterator over this can implement 
{code}advance(target){code} for a DocIdSetIterator, and because of the symmetry 
it can also do that in the reverse direction as needed here.
Compression at higher levels might not be necessary.

For now, there is code for this, except for the frame of reference.
Occasionaly the need for a more space efficient filter shows up on the mailing 
lists, so if anyone want to give this a try...



> BlockJoinQuery/Collector
> 
>
> Key: LUCENE-3171
> URL: https://issues.apache.org/jira/browse/LUCENE-3171
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: Michael McCandless
> Fix For: 3.3, 4.0
>
> Attachments: LUCENE-3171.patch, LUCENE-3171.patch, LUCENE-3171.patch
>
>
> I created a single-pass Query + Collector to implement nested docs.
> The approach is similar to LUCENE-2454, in that the app must index
> documents in "join order", as a block (IW.add/updateDocuments), with
> the parent doc at the end of the block, except that this impl is one
> pass.
> Once you join at indexing time, you can take any query that matches
> child docs and join it up to the parent docID space, using
> BlockJoinQuery.  You then use BlockJoinCollector, which sorts parent
> docs by provided Sort, to gather results, grouped by parent; this
> collector finds any BlockJoinQuerys (using Scorer.visitScorers) and
> retains the child docs corresponding to each collected parent doc.
> After searching is done, you retrieve the TopGroups from a provided
> BlockJoinQuery.
> Like LUCENE-2454, this is less general than the arbitrary joins in
> Solr (SOLR-2272) or parent/child from ElasticSearch
> (https://github.com/elasticsearch/elasticsearch/issues/553), since you
> must do the join at indexing time as a doc block, but it should be
> able to handle nested joins as well as joins to multiple tables,
> though I don't yet have test cases for these.
> I put this in a new Join module (modules/join); I think as we
> refactor join impls we should put them here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3171) BlockJoinQuery/Collector

2011-06-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052642#comment-13052642
 ] 

Michael McCandless commented on LUCENE-3171:


bq. BlockJoinQuery still needs hashCode/equals

Woops, thanks, I'll add!

{quote}
and a javadoc note (as I remarked earlier at 2454) about the possible 
inefficiency of the use of OpenBitSet for larger group sizes. When the typical 
group size gets a lot bigger than the number of bits in a long, another 
implementation might be faster. This remark the in javadocs would allow us to 
wait for someone to come along with bigger group sizes and a real performance 
problem here.
{quote}

Hmm: do you have an improvement in mind for OpenBitSet.prevSetBit to better 
handle large groups?  Or, where is this possible inefficiency (is it something 
specific)?

bq. I would prefer to use single pass and for now I only need the parent docs. 
That means that I have no preference for 2454 or this one.

I wonder how often apps "typically" need just the parent docs vs the groups (w/ 
child docs)...

But, still this patch only calls .nextSetBit() once per group so that ought to 
be faster than LUCENE-2454, I think... hmm, unless you typically only have 1 
child match per parent.

> BlockJoinQuery/Collector
> 
>
> Key: LUCENE-3171
> URL: https://issues.apache.org/jira/browse/LUCENE-3171
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: Michael McCandless
> Fix For: 3.3, 4.0
>
> Attachments: LUCENE-3171.patch, LUCENE-3171.patch
>
>
> I created a single-pass Query + Collector to implement nested docs.
> The approach is similar to LUCENE-2454, in that the app must index
> documents in "join order", as a block (IW.add/updateDocuments), with
> the parent doc at the end of the block, except that this impl is one
> pass.
> Once you join at indexing time, you can take any query that matches
> child docs and join it up to the parent docID space, using
> BlockJoinQuery.  You then use BlockJoinCollector, which sorts parent
> docs by provided Sort, to gather results, grouped by parent; this
> collector finds any BlockJoinQuerys (using Scorer.visitScorers) and
> retains the child docs corresponding to each collected parent doc.
> After searching is done, you retrieve the TopGroups from a provided
> BlockJoinQuery.
> Like LUCENE-2454, this is less general than the arbitrary joins in
> Solr (SOLR-2272) or parent/child from ElasticSearch
> (https://github.com/elasticsearch/elasticsearch/issues/553), since you
> must do the join at indexing time as a doc block, but it should be
> able to handle nested joins as well as joins to multiple tables,
> though I don't yet have test cases for these.
> I put this in a new Join module (modules/join); I think as we
> refactor join impls we should put them here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3171) BlockJoinQuery/Collector

2011-06-21 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052513#comment-13052513
 ] 

Paul Elschot commented on LUCENE-3171:
--

BlockJoinQuery still needs hashCode/equals, and a javadoc note (as I remarked 
earlier at 2454) about the possible inefficiency of the use of OpenBitSet for 
larger group sizes. When the typical group size gets a lot bigger than the 
number of bits in a long, another implementation might be faster. This remark 
the in javadocs would allow us to wait for someone to come along with bigger 
group sizes and a real performance problem here.

I would prefer to use single pass and for now I only need the parent docs. That 
means that I have no preference for 2454 or this one.


> BlockJoinQuery/Collector
> 
>
> Key: LUCENE-3171
> URL: https://issues.apache.org/jira/browse/LUCENE-3171
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: Michael McCandless
> Fix For: 3.3, 4.0
>
> Attachments: LUCENE-3171.patch, LUCENE-3171.patch
>
>
> I created a single-pass Query + Collector to implement nested docs.
> The approach is similar to LUCENE-2454, in that the app must index
> documents in "join order", as a block (IW.add/updateDocuments), with
> the parent doc at the end of the block, except that this impl is one
> pass.
> Once you join at indexing time, you can take any query that matches
> child docs and join it up to the parent docID space, using
> BlockJoinQuery.  You then use BlockJoinCollector, which sorts parent
> docs by provided Sort, to gather results, grouped by parent; this
> collector finds any BlockJoinQuerys (using Scorer.visitScorers) and
> retains the child docs corresponding to each collected parent doc.
> After searching is done, you retrieve the TopGroups from a provided
> BlockJoinQuery.
> Like LUCENE-2454, this is less general than the arbitrary joins in
> Solr (SOLR-2272) or parent/child from ElasticSearch
> (https://github.com/elasticsearch/elasticsearch/issues/553), since you
> must do the join at indexing time as a doc block, but it should be
> able to handle nested joins as well as joins to multiple tables,
> though I don't yet have test cases for these.
> I put this in a new Join module (modules/join); I think as we
> refactor join impls we should put them here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org