Re: [PR] SOLR-17158 Terminate distributed processing quickly when query limit is reached [solr]

via GitHub Thu, 26 Sep 2024 21:57:35 -0700


gus-asf commented on code in PR #2666:
URL: https://github.com/apache/solr/pull/2666#discussion_r1778007524



##########
solr/core/src/java/org/apache/solr/handler/component/ParallelHttpShardHandler.java:
##########
@@ -41,57 +39,78 @@
 @NotThreadSafe
 public class ParallelHttpShardHandler extends HttpShardHandler {
 
+  @SuppressWarnings("unused")
   private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
   private final ExecutorService commExecutor;
 
+  /*
+   * Unlike the basic HttpShardHandler, this class allows us to exit submit 
before
+   * pending is incremented and the responseFutureMap is updated. If the 
runnables that

Review Comment:
   That quote sounds scary because the code is complex and hard to describe. I 
am totally in agreement that we could refactor things further. (it was actually 
just refactored recently to use CompleteableFutures). However, I'm trying very 
hard to resist the (strong) temptation to completely redesign everything since 
I'm hoping a version of this can go in to 9x.
   
   Here's a quick(?!?!) attempt to summarize (and possible first seed material 
of a blog post :) ):
   
   The SearchHandler currently has 2 modes that it alternates until deciding it 
is done... First it enters a while loop that I will call [sending 
mode](https://github.com/apache/solr/blob/d9d2af471610c66d56b43981215afec8e41ff65b/solr/core/src/java/org/apache/solr/handler/component/SearchHandler.java#L517C11-L552C1)
 where it sends every outgoing request it knows about using 
`shardHandler.submit(sreq, shard, params);`. Then it enters a while loop that I 
will call [receiving mode 
](https://github.com/apache/solr/blob/d9d2af471610c66d56b43981215afec8e41ff65b/solr/core/src/java/org/apache/solr/handler/component/SearchHandler.java#L557C1-L577C14)
 where it uses `ShardResponse srsp = tolerant ? 
shardHandler.takeCompletedIncludingErrors() : 
shardHandler.takeCompletedOrError()` to drain the responses that it has 
received unless something new shows up in outgoing. 
   
   These two modes are contained in a while loop that is checking that there 
are no more outgoing requests. So the code reverts to send mode any time 
there's something to send, and exits if it finishes receiving with nothing 
further to send. This outer loop is needed because the receive process may 
queue new requests, but if we are done receiving no more requests can be 
queued, so at that point if nothing needs sending it's safe to continue.
   
   In the classic single threaded HttpShardHandler, we can use a single atomic 
boolean incremented every time we send something and decremented when we 
receive something to keep track of how many things it was asked to send, and if 
we've received them all. 
   
   In the new ParallelHttpShardHandler the send loop only schedules the send 
(to be performed by a thread in an executor before passing the thread, and the 
send mode loop will exit without any guarantee that even one http request is in 
flight yet. The Runnable that is passed to the executor calls 
`this.lbClient.requestAsync(lbReq);` and that method calls out into the actual 
http communication libraries which might be jetty http, jdk http, or whatever 
the latest fad for handling http is in the future. As such it doesn't seem wise 
to trust that it completes without exception. The callback is added to the 
future on the next line, but we can't be sure that that will happen.
   
   Key take away: With parallel shard submit, we have two levels of tracking, 
one for initial intentions to make a request (at the SearchHandler level) and 
one for "has the Runnable been processed at all (at the shard handler level). 
   
   This is important because [the `take(boolean)` based 
methods](https://github.com/apache/solr/blob/d9d2af471610c66d56b43981215afec8e41ff65b/solr/core/src/java/org/apache/solr/handler/component/HttpShardHandler.java#L310C1-L318C4)
 have a loop that depends on obtaining a response for every thing we attempted. 
If we don't **track** what we attempt and **track** that the *attempt was made* 
(regardless of success/fail) then we run the risk of exiting that loop 
**before** we have ***started*** any of our http requests (in the executor 
threads which could delay/pause any amount of time)
   
   So the sentence you quote about "needing to know everything we scheduled has 
been attempted", is referring specifically the new ParallelHttpShardHandler 
class question of whether or not the Runnable has been attempted by the 
Executor. So it's about _request initiation_ (remeber it's calling 
run**Async**()!), and has nothing to do with request completion.
   
   Not discussed for simplicity:
    - tests of responses received vs the shard count that go hay wire the 
pending request count never exceeds the shard count.
    - fake responses created to satisfy the shard count test when shards are 
down.
    - empty strings as shard urls that inflate the shard count when shards are 
down necessitating the fake response.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-17158 Terminate distributed processing quickly when query limit is reached [solr]

Reply via email to