Re: Cross-node joins

Scott Blum Fri, 25 Sep 2015 12:30:56 -0700

Hi Erick,

Thanks for the thoughtful reply!


The context is essentially that I have Groups and Users, and a User can
belong to multiple groups.  So if I need to do a query like "Find all Users
who are members of a Group, for which the Group has certain
characteristics", then I need to do something like {!join from=GroupId
to=UserGroupIds}GroupPermission:admin.  We've already sharded our corpus
such that any given user and that user's associate data have to be on the
same core, but we can't shard the groups that way, since a user could
belong to multiple groups.

Thanks for the pointer to SOLR-4905
<https://issues.apache.org/jira/browse/SOLR-4905>, that would probably work
for us, as we could put all the group docs into a separate collection,
replicate it everywhere, and do local cross-collection joins.  My main
worry there is that having to shard our data in such a way to support this
one case would be a lot of extra operational work over time, and lock us
into a pretty proscriptive data architecture just to solve this one issue.

SOLR-7090 <https://issues.apache.org/jira/browse/SOLR-7090> is closer to
what I was hoping for.  Perhaps I could do something to help that effort.
I didn't realize that existed, I've been looking at LUCENE-3759
<https://issues.apache.org/jira/browse/LUCENE-3759> and wondering how to
make that go.

In essence, This Is A Hard Problem in the Solr world to
> make performant. You'd have to get all of the date from the "from"
> core across the wire to the "to" node, potentially this would
> be the entire corpus.
>

Hopefully it wouldn't be that bad?  My understanding of how queries are
really processed is pretty naive, but I'm imagining that if you have a top
level query containing a collection-wide join, you'd make one distributed
request (to all shards) to resolve the  join into a term query, then a
second one to process the top level request, sending the term list out of
each shard.  I get that there's a pathological case there where the number
of terms explodes, but in theory this wouldn't be too different from
something you do from a client:

1) Run the join query as a facet query.  Instead of retrieving any docs,
just facet the "from" field to get a term list.
2) Run a normal query with the resulting term list.


> You might look at some of the Streaming Aggregation stuff, that
> has some capabilities here too.
>

That's on my radar too.   I did start reading about it, but it looked like
joins were still Work-In-Progress (SOLR-7584
<https://issues.apache.org/jira/browse/SOLR-7584>), and at any rate the
streaming stuff seems so bleeding edge to me (the only doc I've been able
to find on it is from heliosearch) that I was daunted.

Thanks!
Scott

Re: Cross-node joins

Reply via email to