[ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167493#comment-13167493 ]
Michael McCandless commented on LUCENE-3602: -------------------------------------------- Patch looks good! * I like the test... * Maybe rename actualQuery to fromQuery? (So it's clear that JoinQuery runs fromQuery using fromSearcher, joining on fromSearcher.fromField to toSearcher.toField). * Why preComputedFromDocs...? Like if you were to cache something, wouldn't you want cache the toSearcher's bitset instead? * Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult, to contrast it w/ the per-segment scoring? And add a comment explaining that we compute it once (on first segment) and all later segments then reuse it? * I wonder if we could make this a Filter instead, somehow? Ie, at its core it converts a top-level bitset in the fromSearcher doc space into the joined bitset in the toSearcher doc space. It could even maybe just be a static method taking in fromBitset and returning toBitset, which could operate per-segment on the toSearcher side? (Separately: I wonder if JoinQuery should do something with the scores of the fromQuery....? Not right now but maybe later...). * Why does the JoinQuery javadoc say "The downside of this is that in a sharded environment not all documents might get joined / linked." as a downside to the top-level approach? Maybe reword that to state that all joined to/from docs must reside in the same shard? In theory we could (later) make a shard friendly approach? Eg, first pass builds up all unique Terms in the fromSearcher.fromField for docs matching the query (across all shards) and 2nd pass is basically a TermFilter on those... * Not sure it matters, but... including the preComputedFromDocs in hashCode/equals is quite costly (it goes bit by bit...). Maybe it shouldn't be included, since it contains details about the particular searcher that query had been run against? (In theory Query instances are searcher independent.) In general I think this approach is somewhat inefficient, because it always iterates over every possible term in fromSearcher.fromField, checking the docs for each to see if there is a match in the query. Ie, it's like FieldCache, in that it un-inverts, but it's uninverting on every query. I wonder if we could DocTermOrds instead? (Or, FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know fromSearcher.fromField is single-valued). This way we uninvert once (on init), and then doing the join should be much faster since for each fromDocID we can lookup the term(s) to join on. Likewise on the toSearcher side, if we had doc <-> ord/term loaded we could do the forward (term -> ord) lookup quickly (in memory binary search). But then this will obviously use RAM... so we should have the choice (and start w/ the current patch!). > Add join query to Lucene > ------------------------ > > Key: LUCENE-3602 > URL: https://issues.apache.org/jira/browse/LUCENE-3602 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/join > Reporter: Martijn van Groningen > Attachments: LUCENE-3602.patch, LUCENE-3602.patch > > > Solr has (psuedo) join query for a while now. I think this should also be > available in Lucene. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org