Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections 
during high load
......................................................................


Patch Set 1:

(12 comments)

The main problem with subclassing TThreadedServer is that TThreadedServer::Task 
is defined in the thrift cpp file, so we can't access it. We could just copy 
that part, but between that, the copy of serve() that we have to make, and 
boiler plate around the constructors that would be necessary, subclasses 
doesn't really save any copied code other than init() and stop(), so it seems 
cleaner to me to keep the relationships between the classes simple by just 
copying everything rather than subclassing.

I'm not sure what to do about formatting - currently, TAcceptQueueServer.h/cpp 
are written in Thrift's style (eg. wrapped lines indented 2 spaces rather than 
4), though I may have gotten some of it wrong. Of course, I can switch it to 
Impala style if desired.

For the cluster repro - I've crafted a query that consistently gets the "timed 
out" error on the 16 node cluster without the fix. However, with the fix, after 
running somewhat longer, it fails with a 'memory limit exceeded' error. 
Obviously it would be preferable to have a query that both repros and runs to 
completion on a fixed cluster, so I'm working on that, but it does seem like 
the patch works.

http://gerrit.cloudera.org:8080/#/c/4519/1//COMMIT_MSG
Commit Message:

PS1, Line 18: This patch has been tested locally with the repro shown in 
IMPALA-4135
> Can you add the test, modified if necessary?
Done


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/rpc/thrift-server.cc
File be/src/rpc/thrift-server.cc:

Line 419:       server_.reset(new TAcceptQueueThreadedServer(processor_, 
server_socket,
> let's make the choice of server configurable by a flag that we can turn off
Done


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/CMakeLists.txt
File be/src/server/CMakeLists.txt:

PS1, Line 24: add_library(ThriftAcceptQueueThreadedServer
            :     TAcceptQueueThreadedServer.cpp
            :   )
            : add_dependencies(ThriftAcceptQueueThreadedServer thrift-deps)
> Was there a reason you wanted this as a separate library? I think it's fine
Done


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp
File be/src/server/TAcceptQueueThreadedServer.cpp:

Line 144:     inputTransportFactory_->getTransport(client);
> do you see much slowdown in synchronizing around these few lines that creat
Most of this is fast and synchronizing it doesn't affect performance much.


Line 164:     shared_ptr<Thread> thread = 
shared_ptr<Thread>(threadFactory_->newThread(runnable));
> Is this definitely thread-safe? I think this might be the bottleneck (or ma
The bottleneck is 'thread->start()' below. I'm fairly confident that's thread 
safe, as essentially all it does is create a new boost::thread.


Line 200:   ThreadPool<shared_ptr<TTransport>> acceptPool("server-accept", 
"accept-worker", 1, 10000,
> How did you arrive at 1 as the default number of worker threads? What happe
I saw the same behavior as you, where using more than 1 thread here was 
actually slower than just 1. I don't know why that's happening (Michael's fix 
for IMPALA-4026 currently under review doesn't seem to fix it), but given that 
just using 1 thread was enough to solve the problem for the repro in the JIRA, 
and given that only using one prevents us from having to worry about thread 
safety, it seems like the best choice, assuming that it also works in a real 
cluster repro. I could also make it configurable.


PS1, Line 211: Offer
> This can return false, right?
Done


PS1, Line 233: if (stop_) {
> shouldn't this be out of the while loop? to guarantee a proper cleanup.
Its not in the while loop.


Line 236:       acceptPool.DrainAndShutdown();
> I'm not sure we ever actually stop the thrift servers, so it might be moot 
Done


PS1, Line 252: stop_ = false
> why set stop_ back to false?
I guess so that it can be restarted? (this was copied from the thrift code, and 
I think its better to leave it as close to the original behavior as possible)


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.h
File be/src/server/TAcceptQueueThreadedServer.h:

PS1, Line 1: // This file was copied from 
apache::thrift::server::TThreadedServer.cpp v0.9.0, with the
           : // significant changes noted inline below.
           : /*
           :  * Licensed to the Apache Software Foundation (ASF) under one
           :  * or more contributor license agreements. See the NOTICE file
           :  * distributed with this work for additional information
           :  * regarding copyright ownership. The ASF licenses this file
           :  * to you under the Apache License, Version 2.0 (the
           :  * "License"); you may not use this file except in compliance
           :  * with the License. You may obtain a copy of the License at
           :  *
           :  *   http://www.apache.org/licenses/LICENSE-2.0
           :  *
           :  * Unless required by applicable law or agreed to in writing,
           :  * software distributed under the License is distributed on an
           :  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
           :  * KIND, either express or implied. See the License for the
           :  * specific language governing permissions and limitations
           :  * under the License.
           :  */
> AFAIK it's completely fine - copying from another ASF project is legit.
Done


PS1, Line 44: and effectively creating a new,
            :  * internally managed, accept queue.
> not exactly - the connections have been accepted, the next state is transpo
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to