We see errors converting pycapnp builders to C++ capnp builders on Ubuntu 16.04 when using our Python extensions compiled under the "manylinux" environment (Centos-6.8 with gcc 4.8.2)
We pass a pycapnp builder to C++ using this code: https://github.com/numenta/nupic.core/blob/064f8b1ef003d5ee07405cd5ac41583f83ab1d35/src/nupic/py_support/PyCapnp.hpp#L71 When we cast the schema parser to `pycapnp_SchemaParser*` and deref the `thisptr` attribute, the values appear bogus, suggesting an incorrect cast. Pycapnp was installed on Ubuntu 16.04 and builds the extensions and capnproto using gcc 5.4.0. Is it possible that the SchemaParser or SchemaLoader struct from the pycapnp extension built with gcc 5.4.0 has different alignment/layout than expected by the cast in the NuPIC C extension compiled with gcc 4.8.2? More details... First, an overview of capnp's integration into nupic and nupic.bindings: The `nupic` pure python package gets capnp via the `pycapnp==0.5.8` package, which contains its own version of compiled `capnproto` sources. `nupic.bindings`, a python extension built in `nupic.core`, includes its own version of capnp 0.5.3 sources compiled into the extension's shared libraries, such as `_algorithms.so`, `_math.so`, etc. Nupic.bindings contains the C++ implementation of classes and supporting logic used by nupic. So, when nupic is used, there are two versions of compiled capnproto in play: one from pycapnp imported by nupic, and another built into the nupic.bindings extension. On the Ubuntu 16.04 system, pycapnp's capnpproto C++ sources were compiled via gcc/g++ 5.4.0 during installation of pycapnp on that system. The capnproto C++ sources in nupic.bindings were compiled on CentOS-6.8 using gcc/g++ 4.8.2 during the build of the "manylinux" nupic.bindings wheel. Note that those toolchains are a MAJOR VERSION apart and the two extensions compile the capnproto C++ sources independently using their own sets of compiler/linker flags and options (not to mention that the two versions of capnp sources, although similar, might not be identical). When nupic wants to serialize a nupic.bindings-based object, nupic passes the python Builder object instantiated by pycapnp to the nupic.bindings python extension, whose C++ code extracts the C++ Builder from the python Builder. For example, in the case of the Random class, nupic.bindings' _math.so extracts the C++ RandomProto::Builder instance from the python Builder instance at https://github.com/numenta/nupic.core/blob/0.4.4/src/nupic/bindings/math.i#L374-L375, then passes the extracted builder instance to the C++ Random object's `write` method for serialization. So, the nupic.bindings extension's shared libs pass C++ capnp objects instantiated by pycapnp's build of capnp to nupic.bindings-based methods that act on those capnp objects using methods in nupic.bindings' own build of capnproto. To reiterate, objects instantiated by pycapnp's build of capnproto are being operated on by methods in nupic.bindings's own build of capnproto code. This integration happens to work when both pycapnp and nupic.bindings are both compiled/linked on the same platform. Also, it seems to work when the two are compiled/linked with nearby versions of toolchains, such as pycapnp being built on Ubuntu 14.04 with gcc/g++ 4.8.4 and nupic.bindings being built on CentOS-6.8 with gcc/g++ 4.8.2. However, the integration misbehaves when installed on Ubuntu Server 16.04. In this case, pycapnp==0.5.8 is built (as the result of installation from PyPi) on Ubuntu 16.04 by gcc/g++ 5.4.0, but the manylinux nupic.bindings wheel was built on CentOS-6.8 using gcc/g++ 4.8.2. The detailed root-cause analysis is in https://github.com/numenta/nupic.core/issues/1013#issuecomment-235736477 (look for "ROOT-CAUSE ANALYSIS" in that github issue). The short version of it is: 1. nupic.bindings extracts the C++ capnp Builder object from python Builder that was instantiated by the pycapnp python extension. nupic.bindings uses this function that's linked into _math.so to extract the C++ Builder object: ``` template<class T> typename T::Builder getBuilder(PyObject* pyBuilder) { PyObject* capnpModule = PyImport_AddModule("capnp.lib.capnp"); PyObject* pySchemaParser = PyObject_GetAttrString(capnpModule, "_global_schema_parser"); pycapnp_SchemaParser* schemaParser = (pycapnp_SchemaParser*)pySchemaParser; schemaParser->thisptr->loadCompiledTypeAndDependencies<T>(); pycapnp_DynamicStructBuilder* dynamicStruct = (pycapnp_DynamicStructBuilder*)pyBuilder; capnp::DynamicStruct::Builder& builder = dynamicStruct->thisptr; typename T::Builder proto = builder.as<T>(); return proto; } ``` 2. The statement `schemaParser->thisptr->loadCompiledTypeAndDependencies<T>()` invokes `capnp::SchemaParser::loadCompiledTypeAndDependencies()` method on `thisptr`, which is a pointer to the {{capnp::SchemaParser}} instance instantiated by pycapnp's capnp code. 3. However, because `nupic::getBuilder<RandomProto>` is compiled into nupic.bindings' python extension that includes its own version of capnp (in _math.so, in this case), the call to `capnp::SchemaParser::loadCompiledTypeAndDependencies<T>()` resolved to capnp in _math.so, instead of the capnp code in pycapnp build that instantiated this `capnp::SchemaParser` object. 4. This is where things get hairy: when we use gdb to examine the contents of the `capnp::SchemaLoader` referenced by the extracted `capnp::SchemaParser` (that was instantiated by pycapnp's capnp code) at the point where `capnp::SchemaLoader::loadNative` is called inside the nupic.bindings's own build of capnp, we observe that the instance member contents don't make any sense. There is apparently some mismatch taking place between the capnp::SchemaLoader object instantiated by pycapnp's capnp code (built with g++ 5.4.0) and the corresponding capnp::SchemaLoader class in the manylinux nupic.bindings wheel (built with g++ 4.8.2): ``` (gdb) p this $17 = (capnp::SchemaLoader * const) 0x103cba0 (gdb) p *this $18 = {impl = {mutex = {futex = 4031237736, static EXCLUSIVE_HELD = 2147483648, static EXCLUSIVE_REQUESTED = 1073741824, static SHARED_COUNT_MASK = 1073741823}, value = { disposer = 0x7ffff07208e8, ptr = 0xfffffffffffffffd}}} or in hex like this: (gdb) p/x *this $26 = {impl = {mutex = {futex = 0xf047ce68, static EXCLUSIVE_HELD = 0x80000000, static EXCLUSIVE_REQUESTED = 0x40000000, static SHARED_COUNT_MASK = 0x3fffffff}, value = { disposer = 0x7ffff07208e8, ptr = 0xfffffffffffffffd}}} ``` In particular, we note that the instance member `mutex.futex` has an invalid value 0xf047ce68 (it should have been 0 at this point in the single-threaded execution); impl.value.ptr also has an invalid value of 0xfffffffffffffffd - it should have been either null or a valid pointer. Subsequently, when kj::Mutex::lock attempts to lock the futex, the system call never returns, because of the bogus value in mutex.futex. -- You received this message because you are subscribed to the Google Groups "Cap'n Proto" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. Visit this group at https://groups.google.com/group/capnproto.
