kevingurney commented on code in PR #34563:
URL: https://github.com/apache/arrow/pull/34563#discussion_r1167214786


##########
matlab/tools/cmake/BuildMatlabArrowInterface.cmake:
##########
@@ -0,0 +1,80 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# -------
+# Config
+# -------
+
+# Build configuration for libmexclass.
+set(CUSTOM_PROXY_FACTORY_INCLUDE_DIR 
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/proxy;${CMAKE_SOURCE_DIR}/src/cpp")
+set(CUSTOM_PROXY_FACTORY_SOURCES 
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/proxy/factory.cc")
+set(CUSTOM_PROXY_SOURCES 
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/float64_array.cc")
+set(CUSTOM_PROXY_INCLUDE_DIR 
"${CMAKE_SOURCE_DIR}/src/cpp;${ARROW_INCLUDE_DIR}")
+set(CUSTOM_PROXY_LINK_LIBRARIES ${ARROW_LINK_LIB})
+# On Windows, arrow.dll must be installed regardless of
+# whether Arrow_FOUND is true or false. Therefore, we explicitly
+# copy ARROW_SHARED_LIB to the installation folder +libmexclass/+proxy.
+set(CUSTOM_PROXY_RUNTIME_LIBRARIES ${ARROW_SHARED_LIB})
+set(CUSTOM_PROXY_FACTORY_HEADER_FILENAME "factory.h")
+set(CUSTOM_PROXY_FACTORY_CLASS_NAME "arrow::matlab::proxy::Factory")

Review Comment:
   **Update**: We think we've identified the issue!
   
   ---
   
   ## Overview
   
   After a lot of debugging and research, we realized that when the Arrow C++ 
libraries (i.e. `arrow.dll`) are built in `Debug` mode, they are linked against 
a `Debug` version of the [MSVC 
Runtime](https://learn.microsoft.com/en-us/cpp/c-runtime-library/crt-library-features)
 (i.e. `VCRUNTIME140D.dll` - note the **D** for **D**ebug in the DLL name).
   
   At the same time, `mexclass.dll`, `arrowproxy.dll`, and `gateway.mexw64` are 
all built by default against a non-`Debug` version of the MSVC Runtime (i.e. 
`VCRUNTIME140.dll` - note the lack of **D** in the DLL name).
   
   ---
   
   ### Checking the MSVC Runtime using `dumpbin`
   
   We were able to verify the different MSVC Runtime versions in use by the 
various binaries by running `dumpbin` in an instance of `x64 Native Tools 
Command Prompt for VS 2019`:
   
   ```shell
   $ dumpbin /dependents arrow.dll
   Microsoft (R) COFF/PE Dumper Version 14.29.30137.0
   Copyright (C) Microsoft Corporation.  All rights reserved.
   
   
   Dump of file arrow.dll
   
   File Type: DLL
   
     Image has the following dependencies:
   
       KERNEL32.dll
       SHELL32.dll
       ole32.dll
       MSVCP140D.dll
       VCRUNTIME140D.dll
       VCRUNTIME140_1D.dll
       ucrtbased.dll
   
     Summary
   
           1000 .00cfg
          33000 .data
           6000 .idata
         142000 .pdata
         606000 .rdata
          29000 .reloc
           1000 .rsrc
        117D000 .text
           1000 .tls
   
   ```
   
   Note the presence of `VCRUNTIME140D.dll` in the `dumpbin` output for 
`arrow.dll` above.
   
   ---
   
   ### Incompatibility of Multiple Different MSVC Runtimes
   
   Unfortunately, there are no guarantees that one version of the MSVC Runtime 
will be compatible with another. This limitation is documented in the MSVC 
documentation page ["C runtime (CRT) and C++ standard library (STL) `.lib` 
files"](https://learn.microsoft.com/en-us/cpp/c-runtime-library/crt-library-features?view=msvc-170#what-problems-exist-if-an-application-uses-more-than-one-crt-version).
   
   Included below is a relevant excerpt from the documentation:
   
   > A single process may load multiple EXE and DLL images, each with its own 
CRT. Each of those CRTs may use a different allocator, may have different 
internal structure layouts, and may use different storage arrangements. It 
means allocated memory, CRT resources, or classes passed across a DLL boundary 
can cause problems in memory management, internal static usage, or layout 
interpretation. For example, if a class is allocated in one DLL but passed to 
and deleted by another, which CRT deallocator is used? The errors caused can 
range from the subtle to the immediately fatal, and therefore direct transfer 
of such resources is discouraged.
   
   It turns out that the case being described above is exactly what was 
happening when calling `arrow->ToString()` (defined in `arrow.dll`) from 
`Float64Array::Print` (defined in `arrowproxy.dll`).
   
   Calling `ToString()` on an `arrow::Array` instance crosses over into 
`arrow.dll` (which uses a completely separate MSVC Runtime from 
`arrowproxy.dll` when built in `Debug` mode). When `ToString()` allocates 
memory for a `std::string` within the heap space of the MSVC Runtime used by 
`arrow.dll` and then returns that `std::string` to `arrowproxy.dll`, undefined 
behavior is invoked, which eventually leads to the memory corruption we 
observed.
   
   ---
   
   ### Fixing the Issue
   
   To fix the memory corruption, we modified 
`matlab/tools/cmake/BuildMatlabArrowInterface.cmake` so that all of the client 
`libmexclass` binaries (i.e. `mexclass.dll`, `arrowproxy.dll`, and 
`gateway.mexw64`) link against a `Debug` version of the MSVC Runtime - just 
like `arrow.dll` does. We were able to accomplish this by [setting the 
`MSVC_RUNTIME_LIBRARY` target property to `"MultiThreadedDebugDLL"` (i.e. 
`/MDd`)](https://cmake.org/cmake/help/latest/prop_tgt/MSVC_RUNTIME_LIBRARY.html)
 to match the `Debug` build of `arrow.dll`.
   
   This excerpt from the [MSVC 
documentation](https://learn.microsoft.com/en-us/cpp/c-runtime-library/crt-library-features?view=msvc-170#what-problems-exist-if-an-application-uses-more-than-one-crt-version)
 alludes to this fix:
   
   > It's also possible to avoid some of these issues if all of the images in 
your process use the same dynamically loaded version of the CRT. To ensure that 
all components use the same DLL version of the CRT, build them by using the /MD 
option, and use the same compiler toolset and property settings.
   
   The MSVC documentation also provides more information about the potential 
issues that can arise when passing resources between DLLs in the page 
["Potential errors passing CRT objects across DLL 
boundaries"](https://learn.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing-crt-objects-across-dll-boundaries).
   
   The [common advice we've seen around the 
Internet](https://stackoverflow.com/questions/22797418/how-do-i-safely-pass-objects-especially-stl-objects-to-and-from-a-dll)
 relating to this topic is that there is no straightforward, "safe" way to 
export APIs from a DLL which return STL objects (e.g. `std::string` returned by 
`arrow::Array::ToString`). My guess is that other community members have 
already thought through these limitations of the Arrow C++ APIs which return 
STL objects before. However, I think it may be worth adding a note / warning to 
the ["Debug Builds" section of the "Building on Windows" documentation for the 
Arrow C++ 
libraries](https://arrow.apache.org/docs/developers/cpp/windows.html#debug-builds)
 about these potential MSVC Runtime compatibility issues so that others 
hopefully won't have to spend as much effort debugging issues like this in the 
future. *We would be happy to follow up with a separate pull request to enhance 
the documentation with this information!*
   
   ---
   
   ### Open Questions
   
   Although we have identified the source of the memory corruption, there is 
still an open question about whether it is "safe" to build `mexclass.dll`, 
`arrowproxy.dll`, and `gateway.mexw64` against `VCRUNTIME140D.dll`.
   
   MEX functions are eventually loaded into the MATLAB process address space 
when called, so it isn't immediately clear whether this will interact as 
expected in all cases with the MSVC Runtime being used by MATLAB (i.e. 
`matlab.exe`). To get a better understanding of the potential implications 
here, it would make sense for us to consult with some other colleagues at 
MathWorks who have more domain expertise in this area.
   
   It may also make sense to consider modifying the [`CMakeLists.txt` for 
`mathworks/libmexclass`](https://github.com/mathworks/libmexclass/blob/main/libmexclass/cpp/CMakeLists.txt)
 to automatically link against a `Debug` MSVC Runtime when building a client 
project in `Debug` mode. If this isn't desirable as a default behavior, it may 
still be worth ading a note about potential MSVC Runtime compatibility issues 
to the `libmexclass` documentation.
   
   ---
   
   ### Moving Forward
   
   Putting the open questions about the safety of linking against a `Debug` 
MSVC Runtime in a MEX function to the side, it seems reasonable to treat this 
as a non-pressing issue since the vast majority of MATLAB users won't be 
building the MATLAB Interface to Arrow against a `Debug` build of `arrow.dll`. 
In addition, the fix we have identified (i.e. build the client `libmexclass` 
libraries with `/MDd`) appears to be at least a partial workaround for 
community members who do want to use a `Debug` build of `arrow.dll`.
   
   We apologize again for taking some time to get to the bottom of this. It was 
quite surprising that this is what ended up being the source of the memory 
corruption. In order to get to the root cause, we first needed to narrow down 
the reproduction steps and get a better understanding of the differences 
between `Release` and `Debug` builds. We wanted to ensure this memory 
corruption wasn't indicative of a deeper issue in the implementation of 
`libmexclass` before building more code on top of it within the upstream 
`apache/arrow` code base.
   
   Thank you to everyone in the community for your patience while debugging 
this! We should now be able to concentrate more of our energy on pushing the 
MATLAB interface forward since this doesn't seem to be a blocking issue. Please 
don't hesitate to let us know if you have questions about any of this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to