Hi Zack. Here's a small update from my side https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#joining-multiple-shard-collections it's an alternative approach for collocating shards and joining multishrds collections.
On Fri, Sep 20, 2024 at 9:56 PM Zachariah Kendall <therealdr...@icloud.com.invalid> wrote: > Mikhail, no update from my end. > > We are using this feature (in prod) with its current behavior/limitations. > > If you have any questions about it lemme know and I can try to help. > > > On 2023/01/22 19:16:32 Mikhail Khludnev wrote: > > Up^. > > > > Hello! > > Was there an answer? > > Thanks > > > > On Wed, Dec 21, 2022 at 9:38 PM Zack Kendall <za...@gmail.com> > > wrote: > > > > > I'm trying to understand the cross-collection JOIN > > > < > > > > https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#cross-collection-join > > > > > > > documentation, > > > behavior, choices, and viability. > > > > > > *# Terminology language choice* > > > > > > """routerField - If the documents are routed to shards using the > > > CompositeID router by the join field, then that field name should be > > > specified in the configuration here. This will allow the parser to > optimize > > > the resulting HashRange query.""" > > > > > > """routed - If true, the cross collection join query will use each > shard’s > > > hash range to determine the set of join keys to retrieve for that > shard. > > > This parameter improves the performance of the cross-collection join, > but > > > it depends on the local collection being routed by the to field. If > this > > > parameter is not specified, the cross collection join query will try to > > > determine the correct value automatically.""" > > > > > > *Question 1*: Why overload terminology like "route" when these > parameters > > > do NOT route AFAICT. Based on my reading of the code all they do is > add a > > > hash_range fq parameter to the remote join query request. Filtering > results > > > is not routing, so this fosters confusion. Is there reasoning behind > this > > > or just happenstance? > > > > > > *# Implied vs Actual behavior* > > > > > > My reading of the code base is this: the hash_range parameter is always > > > populated with the "fromField" value. The routerField is only used to > check > > > against the "toField" for equality to enable the hash_range parameter > > > usage, this is only done as a fall back if "routed" is not set. > > > > > > It's a little strange to me that "routerField" is not used as a router > > > field, or even as a hash field. It is only used as a flag for "if a > query > > > is joining to THIS field then use hash_range filter on the fromField" > (or > > > at least that's how I read the code). > > > > > > *Question 2:* Is my reading of the code correct? Can we try to update > the > > > documentation to be more explicit about this? > > > > > > > > > *# Routing * > > > > > > *Question 3:* Is there a reason why actual routing was not used? I'm > not > > > familiar with the Solr code base, but it seems like it'd be nicer to > > > instead use existing routing behavior in this context instead of > querying > > > all and filtering results. This seems like it would need 2 things: > First, > > > the _route_ value from the current "local" request, and second, either > the > > > local client (like how solrj does) or the remote "/export" handler > would > > > need to recognize and handle this parameter. Is that obviously doable > or > > > not doable? Trying to understand why that approach wasn't taken > originally. > > > > > > > > > *# Hashing* > > > > > > Here is the behavior touted in the docs for HashRangeQueryParser > > > < > > > > https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#hash-range-query-parser > > > > > > > . > > > """In the cross collection join case, the hash range query parser is > used > > > to ensure that each shard only gets the set of join keys that would > end up > > > on that shard. This query parser uses the MurmurHash3_x86_32. This is > the > > > same as the default hashing for the default composite ID router in > Solr.""" > > > > > > The documentation mentions "CompositeID router", which we know is > based on > > > prefixes (split on "!") being hashed and routed with the first/top 16 > bits > > > of info (with the later 16 bits provided by the rest of the doc "id" on > > > inserts). > > > > > > The CrossCollectionJoinQuery uses 16 bits from the current/local shard > > > range, which seems fine and good. However, the HashRangeQuery appears > to > > > hash > > > the entire field > > > < > > > > https://github.com/apache/solr/blob/26195c82493422cb9d6d4bdf9d4452046e7b3f67/solr/core/src/java/org/apache/solr/search/join/HashRangeQuery.java#L116-L117 > > > >. > > > So I'm struggling to understand how this would work, especially since > the > > > join field and the "route" field are sourced from the same value. > Either > > > the join field is a compositeId in which case the HashRangeQuery code > > > appears to be invalid, as it would not hash "A!B" the same as the > actual > > > router would hash "A", or the join field is not a compositeId in which > case > > > for it to work it would have to be the exact value as the actual > > > compositeId prefix field something like this doc: {"id":"A!B", > > > "myJoinField": "A"}. (Or maybe using "router.field=myJoinField" works > > > without the compositeId/"!" format?). And if the join field is not a > > > compositeId, then the only thing you could join on is the broad > category > > > tenant/product/etc that is used as the compositeId prefix, which would > > > severely limit the use-case of the plugin, preventing joins on > something > > > more akin to record-ids/foreign-keys, and only allowing you to narrow > down > > > the results by what you know ahead of time to cram into the "v=" query > > > field. > > > > > > *Question 4:* Not a specific question so much as "am I onto something > here > > > or am I missing something and off base?" > > > > > > Actually reading through the test code, now I see that my hypothesized > "it > > > could only work if router key and join field are the same value" is > > > actually what is tested. The data is set-up > > > < > > > > https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L128-L130 > > > >with > > > product_id as the compositeId prefix. Then all the test queries > > > < > > > > https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L166-L217 > > > > > > > are > > > joins on another field with the same product_Id value. So that > explains how > > > it can work. > > > > > > *Alternative Use-Case* > > > While I'm here I guess I'll fill in the use-case I was hoping for > based on > > > how we currently do local joins. We want to have two collections which > both > > > route on the same tenantId, whereas our join is on more of a > foreign-key, > > > as seen below. > > > > > > // Collection-1 > > > { > > > "id": "tenantId!abc" > > > "entity": "userUpload", > > > "entity_id": "abc", > > > "uploadedBy": "123", > > > } > > > > > > // Collection-2 > > > { > > > "id": "tenantId!123", > > > "entity": "user", > > > "entity_id": "123", > > > "user_groups": ["xyz",...] > > > } > > > > > > // Query Collection-1, join example adapted to crossCollection. This > will > > > include user-upload documents that were uploaded-by the user in group > xyz. > > > {!join method="crossCollection" > > > fromIndex="Collection-2" // remote > > > from="entity_id" // remote > > > to="uploadedBy" // local > > > v="user_groups:xyz" // remote search filter > > > } > > > > > > This join query works locally and we wish it would work remotely, > > > cross-collection, but it appears incompatible with the current > > > routing/hashing behavior of the plugin. > > > > > > At this point I have worked through it enough that I understand how it > > > currently works, and even rereading the docs it kinda makes more sense > now > > > like the information was there the whole time, but I think this is > still > > > worth raising for awareness and discussion. I don't currently have the > > > need/time to update the plugin to expand its behavior. But I might be > able > > > to update the documentation to make it more clear so that others don't > go > > > through the same rollercoaster and deep dive that I've gone through. > > > > > > Thanks a bunch for any assistance or information regarding this! > > > > > > - Zack > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > > -- Sincerely yours Mikhail Khludnev