[
https://issues.apache.org/jira/browse/DRILL-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zelaine Fong reassigned DRILL-5050:
-----------------------------------
Assignee: Chunhui Shi (was: Parth Chandra)
Assigning to [~cshi] for review.
> C++ client library has symbol resolution issues when loaded by a process that
> already uses boost::asio
> ------------------------------------------------------------------------------------------------------
>
> Key: DRILL-5050
> URL: https://issues.apache.org/jira/browse/DRILL-5050
> Project: Apache Drill
> Issue Type: Bug
> Components: Client - C++
> Affects Versions: 1.6.0
> Environment: MacOs
> Reporter: Parth Chandra
> Assignee: Chunhui Shi
> Fix For: 2.0.0
>
>
> h4. Summary
> On MacOS, the Drill ODBC driver hangs when loaded by any process that might
> also be using {{boost::asio}}. This is observed in trying to connect to Drill
> via the ODBC driver using Tableau.
> h4. Analysis
> The problem is seen in the Drill client library on MacOS. In the method
> {code}
> DrillClientImpl::recvHandshake
> .
> .
> m_io_service.reset();
> if (DrillClientConfig::getHandshakeTimeout() > 0){
>
> m_deadlineTimer.expires_from_now(boost::posix_time::seconds(DrillClientConfig::getHandshakeTimeout()));
> m_deadlineTimer.async_wait(boost::bind(
> &DrillClientImpl::handleHShakeReadTimeout,
> this,
> boost::asio::placeholders::error
> ));
> DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Started new handshake wait
> timer with "
> << DrillClientConfig::getHandshakeTimeout() << " seconds." <<
> std::endl;)
> }
> async_read(
> this->m_socket,
> boost::asio::buffer(m_rbuf, LEN_PREFIX_BUFLEN),
> boost::bind(
> &DrillClientImpl::handleHandshake,
> this,
> m_rbuf,
> boost::asio::placeholders::error,
> boost::asio::placeholders::bytes_transferred)
> );
> DRILL_MT_LOG(DRILL_LOG(LOG_DEBUG) << "DrillClientImpl::recvHandshake:
> async read waiting for server handshake response.\n";)
> m_io_service.run();
> .
> .
> {code}
> The call to {{io_service::run}} returns without invoking any of the handlers
> that have been registered. The {{io_service}} object has two tasks in its
> queue, the timer task, and the socket read task. However, in the run method,
> the state of the {{io_service}} object appears to change and the number of
> outstanding tasks becomes zero. The run method therefore returns immediately.
> Subsequently, any query request sent to the server hangs as data is never
> pulled off the socket.
> This is bizarre behaviour and typically points to build problems.
> More investigation revealed a more interesting thing. {{boost::asio}} is a
> header only library. In other words, there is no actual library
> {{libboost_asio}}. All the code is included into the binary that includes the
> headers of {{boost::asio}}. It so happens that the Tableau process has a
> library (libtabquery) that uses {{boost::asio}} so the code for
> {{boost::asio}} is already loaded into process memory. When the drill client
> library (via the ODBC driver) is loaded by the loader, the drill client
> library loads its own copy of the {{boost:asio}} code. At runtime, the drill
> client code jumps to an address that resolves to an address inside the
> libtabquery copy of {{boost::asio}}. And that code returns incorrectly.
> Really? How is that even allowed? Two copies of {{boost::asio}} in the same
> process? Even if that is allowed, since the code is included at compile time,
> calls to the {{boost::asio}} library should be resolved using internal
> linkage. And if the call to {{boost::asio}} is not resolved statically, the
> dynamic loader would encounter two symbols with the same name and would give
> us an error. And even if the linker picks one of the symbols, as long as the
> code is the same (for example if both libraries use the same version of
> boost) can that cause a problem? Even more importantly, how do we fix that?
> h4. Some assembly required
> The disassembled libdrillClient shows this code inside recvHandshake
> {code}
> 000000000003dd8f movq -0xb0(%rbp), %rdi
> 000000000003dd96 addq $0xc0, %rdi
> 000000000003dd9d callq 0x1bff42 ## symbol stub for:
> __ZN5boost4asio10io_service3runEv
> 000000000003dda2 movq -0xb0(%rbp), %rdi
> 000000000003dda9 cmpq $0x0, 0x190(%rdi)
> 000000000003ddb4 movq %rax, -0x158(%rbp)
> {code}
> and later in the code
> {code}
> 0000000000057216 retq
> 0000000000057217 nopw (%rax,%rax)
> __ZN5boost4asio10io_service3runEv: ## definition of
> io_service::run
> 0000000000057220 pushq %rbp
> 0000000000057221 movq %rsp, %rbp
> 0000000000057224 subq $0x30, %rsp
> 0000000000057228 leaq -0x18(%rbp), %rax
> 000000000005722c movq %rdi, -0x8(%rbp)
> 0000000000057230 movq -0x8(%rbp), %rdi
> 0000000000057234 movq %rdi, -0x28(%rbp)
> {code}
> Note that in recvHandshake the call instruction jumps to an address that is
> an offset (0x1bff42). This offset happens to be beyond the end of the
> library. It certainly isn't the offset at which the io_service::run method is
> defined (0x57220).
> The linker is definitely not resolving the address statically, but we had
> already guessed that. It is, in fact, jumping to a stub method and at
> runtime this address is being resolved to the address of the
> {{io_service::run}} method in libtabquery.
> Just to check, in the debugger, we can see the following two implementations
> of {{io_service::run}} in the process
> {code}
> libtabquery.dylib`boost::asio::io_service::run():
> 0x10d597a10: pushq %rbp
> 0x10d597a11: movq %rsp, %rbp
> 0x10d597a14: pushq %rbx
> 0x10d597a15: subq $0x18, %rsp
> 0x10d597a19: movq %rdi, %rbx
> 0x10d597a1c: movl $0x0, -0x18(%rbp)
> 0x10d597a23: callq 0x10d5b73a4 ; symbol stub for:
> boost::system::system_category()
> 0x10d597a28: movq %rax, -0x10(%rbp)
> 0x10d597a2c: movq 0x8(%rbx), %rdi
> 0x10d597a30: leaq -0x18(%rbp), %rsi
> 0x10d597a34: callq 0x10d5b71e2 ; symbol stub for:
> boost::asio::detail::task_io_service::run(boost::system::error_code&)
> 0x10d597a39: cmpl $0x0, -0x18(%rbp)
> 0x10d597a3d: jne 0x10d597a46 ;
> boost::asio::io_service::run() + 54
> 0x10d597a3f: addq $0x18, %rsp
> 0x10d597a43: popq %rbx
> 0x10d597a44: popq %rbp
> 0x10d597a45: retq
> 0x10d597a46: leaq -0x18(%rbp), %rdi
> 0x10d597a4a: callq 0x10d5b71a6 ; symbol stub for:
> boost::asio::detail::do_throw_error(boost::system::error_code const&)
> 0x10d597a4f: nop
> libdrillClient.dylib`boost::asio::io_service::run() at io_service.ipp:57:
> 0x11f158300: pushq %rbp
> 0x11f158301: movq %rsp, %rbp
> 0x11f158304: subq $0x30, %rsp
> 0x11f158308: leaq -0x18(%rbp), %rax
> 0x11f15830c: movq %rdi, -0x8(%rbp)
> 0x11f158310: movq -0x8(%rbp), %rdi
> 0x11f158314: movq %rdi, -0x28(%rbp)
> 0x11f158318: movq %rax, %rdi
> 0x11f15831b: callq 0x11f2c210c ; symbol stub for:
> boost::system::error_code::error_code()
> 0x11f158320: leaq -0x18(%rbp), %rsi
> 0x11f158324: movq -0x28(%rbp), %rax
> 0x11f158328: movq 0x8(%rax), %rdi
> 0x11f15832c: callq 0x11f2c3516 ; symbol stub for:
> boost::asio::detail::task_io_service::run(boost::system::error_code&)
> 0x11f158331: leaq -0x18(%rbp), %rdi
> 0x11f158335: movq %rax, -0x20(%rbp)
> 0x11f158339: callq 0x11f2c1bf6 ; symbol stub for:
> boost::asio::detail::throw_error(boost::system::error_code const&)
> 0x11f15833e: movq -0x20(%rbp), %rax
> 0x11f158342: addq $0x30, %rsp
> 0x11f158346: popq %rbp
> 0x11f158347: retq
> {code}
> As suspected, the code for the two versions of {{io_service::run}} is
> different, so if the code is executing the wrong version, then the behaviour
> will be, expectedly, unexpected.
> h4. What does not work
> Linking statically with boost has no effect. The code is inlined in the first
> place and is effectively part of the dynamic library already.
> Changing the load order of the libraries (by specifying
> LD_LIBRARY_PATH/DYLD_LIBRARY_PATH does not help). This is because the
> application library is already loaded into the process.
> The linker -prebind flag does not help. The prebind flag is intended to tell
> the linker to resolve all addresses at link time. Why this did not work is
> not clear.
>
> Both libtabquery.dylib and libdrillClient.dylib contain symbols (functions)
> from the {{boost::asio package}}. At runtime, the MacOs loader assigns the
> drillClient library to call the functions defined in libtabquery. This causes
> the code to behave unpredictably and eventually the ODBC driver 'hangs'
> waiting for data from the server.
>
> Because the symbol linkage is being determined at runtime, changing the
> linker settings in the Drill client build has no effect. This is true even if
> you build with static linkage (a remarkable feature of MacOS!). Also, the
> boost builds between libtabquery and libdrillClient are different even if we
> use the same boost version; the compiled code is different. This is a
> critical part of the problem because if the compiled code were the same there
> would be no problem if the code was called using the libtabquery version
> instead of the libdrillClient version.
>
> h4. Solution
> The only way to resolve this is to use a 'shaded' version of boost in the
> drill client library. Luckily for us C++ namespaces, boost's bcp tool, and
> CMake together provide a way to rename the boost namespace to any name we
> like and use it in the drill client code. This effectively renames every
> symbol from boost to a different name using a new namespace name and the
> symbol name conflict does not arise.
> Using this build of boost, and using static linking (just to make sure) in
> the Drill client library, one is able to connect to and run queries against
> Drill from Tableau.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)