I'm trying to understand the cross-collection JOIN
<https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#cross-collection-join>
documentation,
behavior, choices, and viability.

*# Terminology language choice*

"""routerField - If the documents are routed to shards using the
CompositeID router by the join field, then that field name should be
specified in the configuration here. This will allow the parser to optimize
the resulting HashRange query."""

"""routed - If true, the cross collection join query will use each shard’s
hash range to determine the set of join keys to retrieve for that shard.
This parameter improves the performance of the cross-collection join, but
it depends on the local collection being routed by the to field. If this
parameter is not specified, the cross collection join query will try to
determine the correct value automatically."""

*Question 1*: Why overload terminology like "route" when these parameters
do NOT route AFAICT. Based on my reading of the code all they do is add a
hash_range fq parameter to the remote join query request. Filtering results
is not routing, so this fosters confusion. Is there reasoning behind this
or just happenstance?

*# Implied vs Actual behavior*

My reading of the code base is this: the hash_range parameter is always
populated with the "fromField" value. The routerField is only used to check
against the "toField" for equality to enable the hash_range parameter
usage, this is only done as a fall back if "routed" is not set.

It's a little strange to me that "routerField" is not used as a router
field, or even as a hash field. It is only used as a flag for "if a query
is joining to this field then use hash_range filter on the fromField" (or
at least that's how I read the code).

*Question 2:* Is my reading of the code correct? Can we try to update the
documentation to be more explicit about this?


*# Routing *

*Question 3:* Is there a reason why actual routing was not used? I'm not
familiar with the Solr code base, but it seems like it'd be nicer to
instead use existing routing behavior in this context instead of querying
all and filtering results. This seems like it would need 2 things: First,
the _route_ value from the current "local" request, and second, either the
local client (like how solrj does) or the remote "/export" handler would
need to recognize and handle this parameter. Is that obviously doable or
not doable? Trying to understand why that approach wasn't taken originally.


*# Hashing*

Here is the behavior touted in the docs for HashRangeQueryParser
<https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#hash-range-query-parser>
.
"""In the cross collection join case, the hash range query parser is used
to ensure that each shard only gets the set of join keys that would end up
on that shard. This query parser uses the MurmurHash3_x86_32. This is the
same as the default hashing for the default composite ID router in Solr."""

The documentation mentions "CompositeID router", which we know is based on
prefixes (split on "!") being hashed and routed with the first/top 16 bits
of info (with the later 16 bits provided by the rest of the doc "id" on
inserts).

The CrossCollectionJoinQuery uses 16 bits from the current/local shard
range, which seems fine and good. However, the HashRangeQuery appears to hash
the entire field
<https://github.com/apache/solr/blob/26195c82493422cb9d6d4bdf9d4452046e7b3f67/solr/core/src/java/org/apache/solr/search/join/HashRangeQuery.java#L116-L117>.
So I'm struggling to understand how this would work, especially since the
join field and the "route" field are sourced from the same value. Either
the join field is a compositeId in which case the HashRangeQuery code
appears to be invalid, as it would not hash "A!B" the same as the actual
router would hash "A", or the join field is not a compositeId in which case
for it to work it would have to be the exact value as the actual
compositeId prefix field something like this doc: {"id":"A!B",
"myJoinField": "A"}. (Or maybe using "router.field=myJoinField" works
without the compositeId/"!" format?). And if the join field is not a
compositeId, then the only thing you could join on is the broad category
tenant/product/etc that is used as the compositeId prefix, which would
severely limit the use-case of the plugin, preventing joins on something
more akin to record-ids/foreign-keys, and only allowing you to narrow down
the results by what you know ahead of time to cram into the "v=" query
field.

*Question 4:* Not a specific question so much as "am I onto something here
or am I missing something and off base?"

Actually reading through the test code, now I see that my hypothesized "it
could only work if router key and join field are the same value" is
actually what is tested. The data is set-up
<https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L128-L130>with
product_id as the compositeId prefix. Then all the test queries
<https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L166-L217>
are joins on another field with the same product_Id value. So that explains
how it can work.

*Alternative Use-Case*
While I'm here I guess I'll fill in the use-case I was hoping for based on
how we currently do local joins. We want to have two collections which both
route on the same tenantId, whereas our join is on more of a foreign-key,
as seen below.

// Collection-1
{
"id": "tenantId!abc"
    "entity": "userUpload",
    "entity_id": "abc",
    "uploadedBy": "123",
}

// Collection-2
{
"id": "tenantId!123",
    "entity": "user",
    "entity_id": "123",
    "user_groups": ["xyz",...]
}

// Query Collection-1, join example adapted to crossCollection. This will
include user-upload documents that were uploaded-by the user in group xyz.
{!join method="crossCollection"
  fromIndex="Collection-2" // remote
  from="entity_id"  // remote
  to="uploadedBy" // local
  v="user_groups:xyz" // remote search filter
}

This query works locally and should work remotely, cross-collection, but it
appears incompatible with the current routing/hashing behavior of the
plugin.

At this point I have worked through it enough that I understand how it
currently works, and even rereading the docs it kinda makes more sense now
like the information was there the whole time, but I think this is still
worth raising for awareness and discussion. I don't currently have the
need/time to update the plugin to expand its behavior. But I might be able
to update the documentation to make it more clear so that others don't go
through the same rollercoaster and deep dive that I've gone through.

Thanks a bunch for any assistance or information regarding this!

Reply via email to