Re: Re: Trying to understand cross-collection-join routing/hashing choices and behavior

Mikhail Khludnev Sun, 22 Sep 2024 07:40:19 -0700

Hi Zack.
Here's a small update from my side
https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#joining-multiple-shard-collections
it's an alternative approach for collocating shards and joining multishrds
collections.



On Fri, Sep 20, 2024 at 9:56 PM Zachariah Kendall
<therealdr...@icloud.com.invalid> wrote:

> Mikhail, no update from my end.
>
> We are using this feature (in prod) with its current behavior/limitations.
>
> If you have any questions about it lemme know and I can try to help.
>
>
> On 2023/01/22 19:16:32 Mikhail Khludnev wrote:
> > Up^.
> >
> > Hello!
> > Was there an answer?
> > Thanks
> >
> > On Wed, Dec 21, 2022 at 9:38 PM Zack Kendall <za...@gmail.com>
> > wrote:
> >
> > > I'm trying to understand the cross-collection JOIN
> > > <
> > >
> https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#cross-collection-join
> > > >
> > > documentation,
> > > behavior, choices, and viability.
> > >
> > > *# Terminology language choice*
> > >
> > > """routerField - If the documents are routed to shards using the
> > > CompositeID router by the join field, then that field name should be
> > > specified in the configuration here. This will allow the parser to
> optimize
> > > the resulting HashRange query."""
> > >
> > > """routed - If true, the cross collection join query will use each
> shard’s
> > > hash range to determine the set of join keys to retrieve for that
> shard.
> > > This parameter improves the performance of the cross-collection join,
> but
> > > it depends on the local collection being routed by the to field. If
> this
> > > parameter is not specified, the cross collection join query will try to
> > > determine the correct value automatically."""
> > >
> > > *Question 1*: Why overload terminology like "route" when these
> parameters
> > > do NOT route AFAICT. Based on my reading of the code all they do is
> add a
> > > hash_range fq parameter to the remote join query request. Filtering
> results
> > > is not routing, so this fosters confusion. Is there reasoning behind
> this
> > > or just happenstance?
> > >
> > > *# Implied vs Actual behavior*
> > >
> > > My reading of the code base is this: the hash_range parameter is always
> > > populated with the "fromField" value. The routerField is only used to
> check
> > > against the "toField" for equality to enable the hash_range parameter
> > > usage, this is only done as a fall back if "routed" is not set.
> > >
> > > It's a little strange to me that "routerField" is not used as a router
> > > field, or even as a hash field. It is only used as a flag for "if a
> query
> > > is joining to THIS field then use hash_range filter on the fromField"
> (or
> > > at least that's how I read the code).
> > >
> > > *Question 2:* Is my reading of the code correct? Can we try to update
> the
> > > documentation to be more explicit about this?
> > >
> > >
> > > *# Routing *
> > >
> > > *Question 3:* Is there a reason why actual routing was not used? I'm
> not
> > > familiar with the Solr code base, but it seems like it'd be nicer to
> > > instead use existing routing behavior in this context instead of
> querying
> > > all and filtering results. This seems like it would need 2 things:
> First,
> > > the _route_ value from the current "local" request, and second, either
> the
> > > local client (like how solrj does) or the remote "/export" handler
> would
> > > need to recognize and handle this parameter. Is that obviously doable
> or
> > > not doable? Trying to understand why that approach wasn't taken
> originally.
> > >
> > >
> > > *# Hashing*
> > >
> > > Here is the behavior touted in the docs for HashRangeQueryParser
> > > <
> > >
> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#hash-range-query-parser
> > > >
> > > .
> > > """In the cross collection join case, the hash range query parser is
> used
> > > to ensure that each shard only gets the set of join keys that would
> end up
> > > on that shard. This query parser uses the MurmurHash3_x86_32. This is
> the
> > > same as the default hashing for the default composite ID router in
> Solr."""
> > >
> > > The documentation mentions "CompositeID router", which we know is
> based on
> > > prefixes (split on "!") being hashed and routed with the first/top 16
> bits
> > > of info (with the later 16 bits provided by the rest of the doc "id" on
> > > inserts).
> > >
> > > The CrossCollectionJoinQuery uses 16 bits from the current/local shard
> > > range, which seems fine and good. However, the HashRangeQuery appears
> to
> > > hash
> > > the entire field
> > > <
> > >
> https://github.com/apache/solr/blob/26195c82493422cb9d6d4bdf9d4452046e7b3f67/solr/core/src/java/org/apache/solr/search/join/HashRangeQuery.java#L116-L117
> > > >.
> > > So I'm struggling to understand how this would work, especially since
> the
> > > join field and the "route" field are sourced from the same value.
> Either
> > > the join field is a compositeId in which case the HashRangeQuery code
> > > appears to be invalid, as it would not hash "A!B" the same as the
> actual
> > > router would hash "A", or the join field is not a compositeId in which
> case
> > > for it to work it would have to be the exact value as the actual
> > > compositeId prefix field something like this doc: {"id":"A!B",
> > > "myJoinField": "A"}. (Or maybe using "router.field=myJoinField" works
> > > without the compositeId/"!" format?). And if the join field is not a
> > > compositeId, then the only thing you could join on is the broad
> category
> > > tenant/product/etc that is used as the compositeId prefix, which would
> > > severely limit the use-case of the plugin, preventing joins on
> something
> > > more akin to record-ids/foreign-keys, and only allowing you to narrow
> down
> > > the results by what you know ahead of time to cram into the "v=" query
> > > field.
> > >
> > > *Question 4:* Not a specific question so much as "am I onto something
> here
> > > or am I missing something and off base?"
> > >
> > > Actually reading through the test code, now I see that my hypothesized
> "it
> > > could only work if router key and join field are the same value" is
> > > actually what is tested. The data is set-up
> > > <
> > >
> https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L128-L130
> > > >with
> > > product_id as the compositeId prefix. Then all the test queries
> > > <
> > >
> https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L166-L217
> > > >
> > > are
> > > joins on another field with the same product_Id value. So that
> explains how
> > > it can work.
> > >
> > > *Alternative Use-Case*
> > > While I'm here I guess I'll fill in the use-case I was hoping for
> based on
> > > how we currently do local joins. We want to have two collections which
> both
> > > route on the same tenantId, whereas our join is on more of a
> foreign-key,
> > > as seen below.
> > >
> > > // Collection-1
> > > {
> > > "id": "tenantId!abc"
> > >     "entity": "userUpload",
> > >     "entity_id": "abc",
> > >     "uploadedBy": "123",
> > > }
> > >
> > > // Collection-2
> > > {
> > > "id": "tenantId!123",
> > >     "entity": "user",
> > >     "entity_id": "123",
> > >     "user_groups": ["xyz",...]
> > > }
> > >
> > > // Query Collection-1, join example adapted to crossCollection. This
> will
> > > include user-upload documents that were uploaded-by the user in group
> xyz.
> > > {!join method="crossCollection"
> > >   fromIndex="Collection-2" // remote
> > >   from="entity_id"  // remote
> > >   to="uploadedBy" // local
> > >   v="user_groups:xyz" // remote search filter
> > > }
> > >
> > > This join query works locally and we wish it would work remotely,
> > > cross-collection, but it appears incompatible with the current
> > > routing/hashing behavior of the plugin.
> > >
> > > At this point I have worked through it enough that I understand how it
> > > currently works, and even rereading the docs it kinda makes more sense
> now
> > > like the information was there the whole time, but I think this is
> still
> > > worth raising for awareness and discussion. I don't currently have the
> > > need/time to update the plugin to expand its behavior. But I might be
> able
> > > to update the documentation to make it more clear so that others don't
> go
> > > through the same rollercoaster and deep dive that I've gone through.
> > >
> > > Thanks a bunch for any assistance or information regarding this!
> > >
> > > - Zack
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>
>

-- 
Sincerely yours
Mikhail Khludnev

Re: Re: Trying to understand cross-collection-join routing/hashing choices and behavior

Reply via email to