[
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459653#comment-16459653
]
Jeroen Steggink commented on SOLR-11384:
----------------------------------------
Is this still relevant? We have a streaming expression for graph traversal.
> add support for distributed graph query
> ---------------------------------------
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Kevin Watters
> Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some
> limits on where it can be used and also complexities if you want to scale it.
> I believe there's a strong desire to support a fully distributed method of
> doing the Graph Query. I'm working on a patch, it's not complete yet, but if
> anyone would like to have a look at the approach and implementation, I
> welcome much feedback.
> The flow for the distributed graph query is almost exactly the same as the
> normal graph query. The only difference is how it discovers the "frontier
> query" at each level of the traversal.
> When a distribute graph query request comes in, each shard begins by running
> the root query, to know where to start on it's shard. Each participating
> shard then discovers it's edges for the next hop. Those edges are then
> broadcast to all other participating shards. The shard then receives all the
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on
> each one of the solr nodes in the cluster. When a graph query is created, it
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did
> work pretty well. The new approach is migrating the FrontierBroker to be a
> request handler in Solr, and potentially to use the SolrJ client to publish
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know
> what the list of shards are that are participating in the current query
> request? Is that easy info to get at?
> Second, currently, we are serializing a query object between the shards,
> perhaps we should consider a slightly different abstraction, and serialize
> lists of "edge" objects between the nodes. The point of this would be to
> batch the exploration/traversal of current frontier to help avoid large
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier
> queries, if any? And if we do some caching there, how/when should the
> entries be expired and auto-warmed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]