Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load ......................................................................
Patch Set 1: (12 comments) The main problem with subclassing TThreadedServer is that TThreadedServer::Task is defined in the thrift cpp file, so we can't access it. We could just copy that part, but between that, the copy of serve() that we have to make, and boiler plate around the constructors that would be necessary, subclasses doesn't really save any copied code other than init() and stop(), so it seems cleaner to me to keep the relationships between the classes simple by just copying everything rather than subclassing. I'm not sure what to do about formatting - currently, TAcceptQueueServer.h/cpp are written in Thrift's style (eg. wrapped lines indented 2 spaces rather than 4), though I may have gotten some of it wrong. Of course, I can switch it to Impala style if desired. For the cluster repro - I've crafted a query that consistently gets the "timed out" error on the 16 node cluster without the fix. However, with the fix, after running somewhat longer, it fails with a 'memory limit exceeded' error. Obviously it would be preferable to have a query that both repros and runs to completion on a fixed cluster, so I'm working on that, but it does seem like the patch works. http://gerrit.cloudera.org:8080/#/c/4519/1//COMMIT_MSG Commit Message: PS1, Line 18: This patch has been tested locally with the repro shown in IMPALA-4135 > Can you add the test, modified if necessary? Done http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/rpc/thrift-server.cc File be/src/rpc/thrift-server.cc: Line 419: server_.reset(new TAcceptQueueThreadedServer(processor_, server_socket, > let's make the choice of server configurable by a flag that we can turn off Done http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/CMakeLists.txt File be/src/server/CMakeLists.txt: PS1, Line 24: add_library(ThriftAcceptQueueThreadedServer : TAcceptQueueThreadedServer.cpp : ) : add_dependencies(ThriftAcceptQueueThreadedServer thrift-deps) > Was there a reason you wanted this as a separate library? I think it's fine Done http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp File be/src/server/TAcceptQueueThreadedServer.cpp: Line 144: inputTransportFactory_->getTransport(client); > do you see much slowdown in synchronizing around these few lines that creat Most of this is fast and synchronizing it doesn't affect performance much. Line 164: shared_ptr<Thread> thread = shared_ptr<Thread>(threadFactory_->newThread(runnable)); > Is this definitely thread-safe? I think this might be the bottleneck (or ma The bottleneck is 'thread->start()' below. I'm fairly confident that's thread safe, as essentially all it does is create a new boost::thread. Line 200: ThreadPool<shared_ptr<TTransport>> acceptPool("server-accept", "accept-worker", 1, 10000, > How did you arrive at 1 as the default number of worker threads? What happe I saw the same behavior as you, where using more than 1 thread here was actually slower than just 1. I don't know why that's happening (Michael's fix for IMPALA-4026 currently under review doesn't seem to fix it), but given that just using 1 thread was enough to solve the problem for the repro in the JIRA, and given that only using one prevents us from having to worry about thread safety, it seems like the best choice, assuming that it also works in a real cluster repro. I could also make it configurable. PS1, Line 211: Offer > This can return false, right? Done PS1, Line 233: if (stop_) { > shouldn't this be out of the while loop? to guarantee a proper cleanup. Its not in the while loop. Line 236: acceptPool.DrainAndShutdown(); > I'm not sure we ever actually stop the thrift servers, so it might be moot Done PS1, Line 252: stop_ = false > why set stop_ back to false? I guess so that it can be restarted? (this was copied from the thrift code, and I think its better to leave it as close to the original behavior as possible) http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.h File be/src/server/TAcceptQueueThreadedServer.h: PS1, Line 1: // This file was copied from apache::thrift::server::TThreadedServer.cpp v0.9.0, with the : // significant changes noted inline below. : /* : * Licensed to the Apache Software Foundation (ASF) under one : * or more contributor license agreements. See the NOTICE file : * distributed with this work for additional information : * regarding copyright ownership. The ASF licenses this file : * to you under the Apache License, Version 2.0 (the : * "License"); you may not use this file except in compliance : * with the License. You may obtain a copy of the License at : * : * http://www.apache.org/licenses/LICENSE-2.0 : * : * Unless required by applicable law or agreed to in writing, : * software distributed under the License is distributed on an : * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY : * KIND, either express or implied. See the License for the : * specific language governing permissions and limitations : * under the License. : */ > AFAIK it's completely fine - copying from another ASF project is legit. Done PS1, Line 44: and effectively creating a new, : * internally managed, accept queue. > not exactly - the connections have been accepted, the next state is transpo Done -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-HasComments: Yes