Hi Jason, here is one thing that should probably be corrected in pycapnp. In 
debugging these issues, I discovered that both nupic.bindings and pycapnp 
extensions were not following python extensions best practices concerning 
symbol visibility. All symbols in both were visible and as such could preempt 
same symbols in other extension modules (and possibly python runtime) and vise 
versa, depending on the order the extensions were loaded. This in fact was 
taking place between pycapnp and nupic.bindings in both directions, depending 
on which one was imported first.

https://docs.python.org/3.0/extending/extending.html#providing-a-c-api-for-an-extension-module
 stipulates that all symbols implemented in a python extension must be hidden, 
except for the extension initialization function

In nupic.bindings, I addressed this using the combination of visibility 
compilation flags that also enhance code optimization (see 
https://github.com/numenta/nupic.core/blob/affa66e6ce87d92ebd23d2ffe99270e9f89f8a27/CommonCompilerConfig.cmake#L258-L259)
 and an export map as a catch-all for other cases over which I had no control, 
such as the statically-linked libstdc++ (see 
https://github.com/numenta/nupic.core/blob/b36835251bdffba6b367ef8df39e93df9866f785/src/CMakeLists.txt#L771-L795).
 The export map is a good safeguard that I recommend using in addition to the 
visibility control compilation flags.

Best,
Vitaly

From: Vitaly Kruglikov
Subject: Re: [capnproto] Errors converting pycapnp builders to C++ capnp 
builders on Ubuntu 16.04 when using our manylinux Python extensions

Hi Jason, thank you for following up. Scott and I are also thinking about a 
long-term solution for this. In the meantime, I root-caused the issue. It was 
the confluence of runtime symbol preemption and c++ ABI incompatibility between 
the manylinux build environment and the eventual runtime environment (Ubuntu 
16.04), where pycapnp’s capnproto code is compiled. A detailed analysis may be 
found in 
https://discourse.numenta.org/t/segmentation-fault-while-running-basic-swarm/877/23?u=vkruglikov.
 And my short-term solution is described in 
https://discourse.numenta.org/t/segmentation-fault-while-running-basic-swarm/877/24?u=vkruglikov.

Best,
Vitaly

From: Jason Paryani
Subject: Re: [capnproto] Errors converting pycapnp builders to C++ capnp 
builders on Ubuntu 16.04 when using our manylinux Python extensions

Hey Vitaly,

I'm not sure what the best solution here is. Ideally, we want both pycapnp and 
nupic to always link to the exact same compiled version of C++ Cap'n Proto.

Perhaps it should be pycapnp's responsibility to always bundle a compiled 
distribution of C++ Cap'n Proto, complete with headers. Then in nupic's build 
process, it could call some pycapnp function that supplies said directories 
(similar to how numpy.get_include() works). Does that sound reasonable to you?

On Thu, Jul 28, 2016 at 6:17 PM, vitaly numenta 
<[email protected]<mailto:[email protected]>> wrote:
We see errors converting pycapnp builders to C++ capnp builders on Ubuntu 16.04
when using our Python extensions compiled under the "manylinux" environment
(Centos-6.8 with gcc 4.8.2)

We pass a pycapnp builder to C++ using this code:
https://github.com/numenta/nupic.core/blob/064f8b1ef003d5ee07405cd5ac41583f83ab1d35/src/nupic/py_support/PyCapnp.hpp#L71

When we cast the schema parser to `pycapnp_SchemaParser*` and deref the
`thisptr` attribute, the values appear bogus, suggesting an incorrect cast.

Pycapnp was installed on Ubuntu 16.04 and builds the extensions and capnproto
using gcc 5.4.0.

Is it possible that the SchemaParser or SchemaLoader struct from the pycapnp
extension built with gcc 5.4.0 has different alignment/layout than expected by
the cast in the NuPIC C extension compiled with gcc 4.8.2?


More details...

First, an overview of capnp's integration into nupic and nupic.bindings: The
`nupic` pure python package gets capnp via the `pycapnp==0.5.8` package, which
contains its own version of compiled `capnproto` sources. `nupic.bindings`, a 
python
extension built in `nupic.core`, includes its own version of capnp 0.5.3 sources
compiled into the extension's shared libraries, such as `_algorithms.so`,
`_math.so`, etc. Nupic.bindings contains the C++ implementation of classes and 
supporting
logic used by nupic.

So, when nupic is used, there are two versions of compiled capnproto in play: 
one
from pycapnp imported  by nupic, and another built into the nupic.bindings 
extension.
On the Ubuntu 16.04 system, pycapnp's capnpproto C++ sources were compiled via
gcc/g++ 5.4.0 during installation of pycapnp on that system. The capnproto C++
sources in nupic.bindings were compiled on CentOS-6.8 using gcc/g++ 4.8.2 during
the build of the "manylinux" nupic.bindings wheel. Note that those toolchains 
are a
MAJOR VERSION apart and the two extensions compile the capnproto C++ sources
independently using their own sets of compiler/linker flags and options (not to
mention that the two versions of capnp sources, although similar, might not be
identical).

When nupic wants to serialize a nupic.bindings-based object, nupic passes the
python Builder object instantiated by pycapnp to the nupic.bindings python
extension, whose C++ code extracts the C++ Builder from the python Builder. For
example, in the case of the Random class, nupic.bindings' _math.so extracts the
C++ RandomProto::Builder instance from the python Builder instance at
https://github.com/numenta/nupic.core/blob/0.4.4/src/nupic/bindings/math.i#L374-L375,
then passes the extracted builder instance to the C++ Random object's
`write` method for serialization.

So, the nupic.bindings extension's shared libs pass C++ capnp objects
instantiated by pycapnp's build of capnp to nupic.bindings-based methods that
act on those capnp objects using methods in nupic.bindings' own build of 
capnproto.
To reiterate, objects instantiated by pycapnp's build of capnproto are being
operated on by methods in nupic.bindings's own build of capnproto code.

This integration happens to work when both pycapnp and nupic.bindings are both
compiled/linked on the same platform. Also, it seems to work when the two are
compiled/linked with nearby versions of toolchains, such as pycapnp being built
on Ubuntu 14.04 with gcc/g++ 4.8.4 and nupic.bindings being built on CentOS-6.8
with gcc/g++ 4.8.2.

However, the integration misbehaves when installed on Ubuntu Server 16.04. In
this case, pycapnp==0.5.8 is built (as the result of installation from PyPi) on
Ubuntu 16.04 by gcc/g++ 5.4.0, but the manylinux nupic.bindings wheel was built
on CentOS-6.8 using gcc/g++ 4.8.2. The detailed root-cause analysis is in
https://github.com/numenta/nupic.core/issues/1013#issuecomment-235736477 (look
for "ROOT-CAUSE ANALYSIS" in that github issue). The short version of it is:

1. nupic.bindings extracts the C++ capnp Builder object from python Builder that
was instantiated by the pycapnp python extension. nupic.bindings uses this
function that's linked into _math.so to extract the C++ Builder object:

```
template<class T> typename T::Builder getBuilder(PyObject* pyBuilder)
{
    PyObject* capnpModule = PyImport_AddModule("capnp.lib.capnp"); PyObject*
    pySchemaParser = PyObject_GetAttrString(capnpModule, 
"_global_schema_parser");

    pycapnp_SchemaParser* schemaParser = (pycapnp_SchemaParser*)pySchemaParser;
    schemaParser->thisptr->loadCompiledTypeAndDependencies<T>();

    pycapnp_DynamicStructBuilder* dynamicStruct = 
(pycapnp_DynamicStructBuilder*)pyBuilder;
    capnp::DynamicStruct::Builder& builder = dynamicStruct->thisptr;
    typename T::Builder proto = builder.as<http://builder.as><T>();
    return proto;
}
```

2. The statement `schemaParser->thisptr->loadCompiledTypeAndDependencies<T>()`
invokes `capnp::SchemaParser::loadCompiledTypeAndDependencies()` method on
`thisptr`, which is a pointer to the {{capnp::SchemaParser}} instance
instantiated by pycapnp's capnp code.

3. However, because `nupic::getBuilder<RandomProto>` is compiled into
nupic.bindings' python extension that includes its own version of capnp  (in
_math.so, in this case), the call to
`capnp::SchemaParser::loadCompiledTypeAndDependencies<T>()` resolved to capnp in
_math.so, instead of the capnp code in pycapnp build that instantiated this
`capnp::SchemaParser` object.

4. This is where things get hairy: when we use gdb to examine the contents of
the `capnp::SchemaLoader` referenced by the extracted `capnp::SchemaParser`
(that was instantiated by pycapnp's capnp code) at the point where
`capnp::SchemaLoader::loadNative` is called inside the nupic.bindings's own
build of capnp, we observe that the instance member contents don't make any
sense. There is apparently some mismatch taking place between the
capnp::SchemaLoader object instantiated by pycapnp's capnp code (built with g++
5.4.0) and the corresponding capnp::SchemaLoader class in the manylinux
nupic.bindings wheel (built with g++ 4.8.2):

```
(gdb) p this
$17 = (capnp::SchemaLoader * const) 0x103cba0 (gdb) p *this $18
= {impl = {mutex = {futex = 4031237736, static EXCLUSIVE_HELD = 
2147483648<tel:2147483648>,
static EXCLUSIVE_REQUESTED = 1073741824, static SHARED_COUNT_MASK = 1073741823},
value = { disposer = 0x7ffff07208e8, ptr = 0xfffffffffffffffd}}}

or in hex like this:

(gdb) p/x *this
$26 = {impl = {mutex = {futex = 0xf047ce68,
static EXCLUSIVE_HELD = 0x80000000, static EXCLUSIVE_REQUESTED = 0x40000000,
static SHARED_COUNT_MASK = 0x3fffffff}, value = { disposer = 0x7ffff07208e8, ptr
= 0xfffffffffffffffd}}}
```

In particular, we note that the instance member `mutex.futex` has an invalid
value 0xf047ce68 (it should have been 0 at this point in the single-threaded
execution); impl.value.ptr also has an invalid value of 0xfffffffffffffffd - it
should have been either null or a valid pointer. Subsequently, when
kj::Mutex::lock attempts to lock the futex, the system call never returns,
because of the bogus value in mutex.futex.

--
You received this message because you are subscribed to the Google Groups 
"Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
Visit this group at https://groups.google.com/group/capnproto.

-- 
You received this message because you are subscribed to the Google Groups 
"Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
Visit this group at https://groups.google.com/group/capnproto.

Reply via email to