[ 
https://issues.apache.org/jira/browse/DRILL-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685263#comment-15685263
 ] 

ASF GitHub Bot commented on DRILL-5050:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/659#discussion_r88991123
  
    --- Diff: contrib/native/client/readme.boost ---
    @@ -0,0 +1,53 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +Building Boost for Drill on MacOs/Linux
    +--------------------------------
    +
    +These instructions are using Boost version 1.60.0 which is recommended
    +
    +Assuming there is a BOOST_BUILD_DIR 
    +
    +$ cd $BOOST_BUILD_DIR
    +$ tar zxf boost_1_60_0.tar.gz
    +$ cd $BOOST_BUILD_DIR/boost_1_60_0
    +$ ./bootstrap.sh --prefix=$BOOST_BUILD_DIR/boost_1_60_0/
    +$ ./b2 tools/bcp
    +$ cd $BOOST_BUILD_DIR/drill_boost_1_60_0
    +
    +# Use boost bcp to rename the boost namespace to drill_boost
    +# the following builds a subset of boost without icu. You may need to add 
more modules to include icu. 
    +# bcp documentation can be found here: 
http://www.boost.org/doc/libs/1_60_0/tools/bcp/doc/html/index.html
    +
    +$ $BOOST_BUILD_DIR/boost_1_60_0/dist/bin/bcp --namespace=drill_boost 
--namespace-alias --boost=$BOOST_BUILD_DIR/boost_1_60_0/ shared_ptr random 
context chrono date_time regex system timer thread asio smart_ptr bind config 
build regex config assign $BOOST_BUILD_DIR/drill_boost_1_60_0 
    --- End diff --
    
    Good catch. I started with the required components, let bcp figure out the 
dependencies, compiled the client (using only the files included by bcp) and 
added the dependencies that bcp didn't find. So multiprecision and functional 
are probably included, but it's probably better to include them explicitly.



> C++ client library has symbol resolution issues when loaded by a process that 
> already uses boost::asio
> ------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5050
>                 URL: https://issues.apache.org/jira/browse/DRILL-5050
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Client - C++
>    Affects Versions: 1.6.0
>         Environment: MacOs
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 2.0.0
>
>
> h4. Summary
> On MacOS, the Drill ODBC driver hangs when loaded by any process that might 
> also be using {{boost::asio}}. This is observed in trying to connect to Drill 
> via the ODBC driver using Tableau.
> h4. Analysis
> The problem is seen in the Drill client library on MacOS. In the method 
> {code}
>  DrillClientImpl::recvHandshake
> .
> .
>     m_io_service.reset();
>     if (DrillClientConfig::getHandshakeTimeout() > 0){
>         
> m_deadlineTimer.expires_from_now(boost::posix_time::seconds(DrillClientConfig::getHandshakeTimeout()));
>         m_deadlineTimer.async_wait(boost::bind(
>                     &DrillClientImpl::handleHShakeReadTimeout,
>                     this,
>                     boost::asio::placeholders::error
>                     ));
>         DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Started new handshake wait 
> timer with "
>                 << DrillClientConfig::getHandshakeTimeout() << " seconds." << 
> std::endl;)
>     }
>     async_read(
>             this->m_socket,
>             boost::asio::buffer(m_rbuf, LEN_PREFIX_BUFLEN),
>             boost::bind(
>                 &DrillClientImpl::handleHandshake,
>                 this,
>                 m_rbuf,
>                 boost::asio::placeholders::error,
>                 boost::asio::placeholders::bytes_transferred)
>             );
>     DRILL_MT_LOG(DRILL_LOG(LOG_DEBUG) << "DrillClientImpl::recvHandshake: 
> async read waiting for server handshake response.\n";)
>     m_io_service.run();
> .
> .
> {code}
> The call to {{io_service::run}} returns without invoking any of the handlers 
> that have been registered. The {{io_service}} object has two tasks in its 
> queue, the timer task, and the socket read task. However, in the run method, 
> the state of the {{io_service}} object appears to change and the number of 
> outstanding tasks becomes zero. The run method therefore returns immediately. 
> Subsequently, any query request sent to the server hangs as data is never 
> pulled off the socket.
> This is bizarre behaviour and typically points to build problems. 
> More investigation revealed a more interesting thing. {{boost::asio}} is a 
> header only library. In other words, there is no actual library 
> {{libboost_asio}}. All the code is included into the binary that includes the 
> headers of {{boost::asio}}. It so happens that the Tableau process has a 
> library (libtabquery) that uses {{boost::asio}} so the code for 
> {{boost::asio}} is already loaded into process memory. When the drill client 
> library (via the ODBC driver) is loaded by the loader, the drill client 
> library loads its own copy of the {{boost:asio}} code.  At runtime, the drill 
> client code jumps to an address that resolves to an address inside the 
> libtabquery copy of {{boost::asio}}. And that code returns incorrectly.
> Really? How is that even allowed? Two copies of {{boost::asio}} in the same 
> process? Even if that is allowed, since the code is included at compile time, 
> calls to the {{boost::asio}} library should be resolved using internal 
> linkage. And if the call to {{boost::asio}} is not resolved statically, the 
> dynamic loader would encounter two symbols with the same name and would give 
> us an error. And even if the linker picks one of the symbols, as long as the 
> code is the same (for example if both libraries use the same version of 
> boost) can that cause a problem? Even more importantly, how do we fix that?
> h4. Some assembly required
> The disassembled libdrillClient shows this code inside recvHandshake
> {code}
> 000000000003dd8f    movq    -0xb0(%rbp), %rdi       
> 000000000003dd96    addq    $0xc0, %rdi
> 000000000003dd9d    callq   0x1bff42                ## symbol stub for: 
> __ZN5boost4asio10io_service3runEv
> 000000000003dda2    movq    -0xb0(%rbp), %rdi
> 000000000003dda9    cmpq    $0x0, 0x190(%rdi)
> 000000000003ddb4    movq    %rax, -0x158(%rbp)
> {code}
> and later in the code 
> {code}
> 0000000000057216    retq    
> 0000000000057217    nopw    (%rax,%rax)
> __ZN5boost4asio10io_service3runEv:                 ## definition of 
> io_service::run
> 0000000000057220    pushq   %rbp
> 0000000000057221    movq    %rsp, %rbp
> 0000000000057224    subq    $0x30, %rsp
> 0000000000057228    leaq    -0x18(%rbp), %rax
> 000000000005722c    movq    %rdi, -0x8(%rbp)        
> 0000000000057230    movq    -0x8(%rbp), %rdi
> 0000000000057234    movq    %rdi, -0x28(%rbp)
> {code}
> Note that in recvHandshake the call instruction jumps to an address that is 
> an offset (0x1bff42). This offset happens to be beyond the end of the 
> library. It certainly isn't the offset at which the io_service::run method is 
> defined (0x57220).
> The linker is definitely not resolving the address statically, but we had 
> already guessed that. It is, in fact, jumping to a stub method and  at 
> runtime this address is being resolved to the address of the 
> {{io_service::run}} method in libtabquery.
> Just to check, in the debugger, we can see the following two implementations 
> of {{io_service::run}} in the process
> {code}
> libtabquery.dylib`boost::asio::io_service::run():
>    0x10d597a10:  pushq  %rbp
>    0x10d597a11:  movq   %rsp, %rbp
>    0x10d597a14:  pushq  %rbx
>    0x10d597a15:  subq   $0x18, %rsp
>    0x10d597a19:  movq   %rdi, %rbx
>    0x10d597a1c:  movl   $0x0, -0x18(%rbp)
>    0x10d597a23:  callq  0x10d5b73a4               ; symbol stub for: 
> boost::system::system_category()
>    0x10d597a28:  movq   %rax, -0x10(%rbp) 
>    0x10d597a2c:  movq   0x8(%rbx), %rdi             
>    0x10d597a30:  leaq   -0x18(%rbp), %rsi
>    0x10d597a34:  callq  0x10d5b71e2               ; symbol stub for: 
> boost::asio::detail::task_io_service::run(boost::system::error_code&)
>    0x10d597a39:  cmpl   $0x0, -0x18(%rbp)
>    0x10d597a3d:  jne    0x10d597a46               ; 
> boost::asio::io_service::run() + 54
>    0x10d597a3f:  addq   $0x18, %rsp
>    0x10d597a43:  popq   %rbx
>    0x10d597a44:  popq   %rbp
>    0x10d597a45:  retq   
>    0x10d597a46:  leaq   -0x18(%rbp), %rdi
>    0x10d597a4a:  callq  0x10d5b71a6               ; symbol stub for: 
> boost::asio::detail::do_throw_error(boost::system::error_code const&)
>    0x10d597a4f:  nop        
> libdrillClient.dylib`boost::asio::io_service::run() at io_service.ipp:57:
>    0x11f158300:  pushq  %rbp
>    0x11f158301:  movq   %rsp, %rbp
>    0x11f158304:  subq   $0x30, %rsp
>    0x11f158308:  leaq   -0x18(%rbp), %rax
>    0x11f15830c:  movq   %rdi, -0x8(%rbp)
>    0x11f158310:  movq   -0x8(%rbp), %rdi
>    0x11f158314:  movq   %rdi, -0x28(%rbp)
>    0x11f158318:  movq   %rax, %rdi
>    0x11f15831b:  callq  0x11f2c210c               ; symbol stub for: 
> boost::system::error_code::error_code()
>    0x11f158320:  leaq   -0x18(%rbp), %rsi
>    0x11f158324:  movq   -0x28(%rbp), %rax           
>    0x11f158328:  movq   0x8(%rax), %rdi
>    0x11f15832c:  callq  0x11f2c3516               ; symbol stub for: 
> boost::asio::detail::task_io_service::run(boost::system::error_code&)
>    0x11f158331:  leaq   -0x18(%rbp), %rdi
>    0x11f158335:  movq   %rax, -0x20(%rbp)
>    0x11f158339:  callq  0x11f2c1bf6               ; symbol stub for: 
> boost::asio::detail::throw_error(boost::system::error_code const&)
>    0x11f15833e:  movq   -0x20(%rbp), %rax
>    0x11f158342:  addq   $0x30, %rsp
>    0x11f158346:  popq   %rbp
>    0x11f158347:  retq   
> {code}
> As suspected, the code for the two versions of {{io_service::run}} is 
> different, so if the code is executing the wrong version, then the behaviour 
> will be, expectedly, unexpected.
> h4. What does not work
> Linking statically with boost has no effect. The code is inlined in the first 
> place and is effectively part of the dynamic library already. 
> Changing the load order of the libraries (by specifying 
> LD_LIBRARY_PATH/DYLD_LIBRARY_PATH does not help). This is because the 
> application library is already loaded into the process.
> The linker -prebind flag does not help. The prebind flag is intended to tell 
> the linker to resolve all addresses at link time. Why this did not work is 
> not clear.
>  
> Both libtabquery.dylib and libdrillClient.dylib contain symbols (functions) 
> from the {{boost::asio package}}. At runtime, the MacOs loader assigns the 
> drillClient library to call the functions defined in libtabquery. This causes 
> the code to behave unpredictably and eventually the ODBC driver 'hangs' 
> waiting for data from the server.
>  
> Because the symbol linkage is being determined at runtime, changing the 
> linker settings in the Drill client build has no effect. This is true even if 
> you build with static linkage (a remarkable feature of MacOS!). Also, the 
> boost builds between libtabquery and libdrillClient are different even if we 
> use the same boost version; the compiled code is different. This is a 
> critical part of the problem because if the compiled code were the same there 
> would be no problem if the code was called using the libtabquery version 
> instead of the libdrillClient version.
>  
> h4. Solution
> The only way to resolve this is to use a 'shaded' version of boost in the 
> drill client library. Luckily for us C++ namespaces, boost's bcp tool, and 
> CMake together provide a way to rename the boost namespace to any name we 
> like and use it in the drill client code. This effectively renames every 
> symbol from boost to a different name using a new namespace name and the 
> symbol name conflict does not arise.
> Using this build of boost, and using static linking (just to make sure) in 
> the Drill client library, one is able to connect to and run queries against 
> Drill from Tableau.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to