frmnboi opened a new issue #10488:
URL: https://github.com/apache/arrow/issues/10488


   I'm trying to write a C++ extension to add a new column to a table I have.  
I create the table with pyarrow in python, but I want to call a function in C++ 
to operate on the data, in-place if possible.  Currently, I have this:
   
   **helperfuncs.cpp**
   ```
   #include <pybind11/pybind11.h>
   #include <Python.h>
   
   #include <arrow/python/pyarrow.h>
   #include <arrow/array/builder_primitive.h>
   
   
   //arrow::Array;
   //arrow::ChunkedArray;
   
   std::shared_ptr<arrow::DoubleArray> 
vol_adj_close(std::shared_ptr<arrow::DoubleArray>& 
close,std::shared_ptr<arrow::Int64Array>& volume)
   {
       // auto close=std::static_pointer_cast<arrow::DoubleArray>(closeraw);
       // auto volume=std::static_pointer_cast<arrow::DoubleArray>(volumeraw);
       if (close->length()!=volume->length())
           throw std::length_error("Arrays are not of equal length");
       arrow::DoubleBuilder builder;
       arrow::Status status = builder.Resize(close->length());
       if (!status.ok()) {
           throw std::bad_alloc();
       }
       for(int i = 0; i < volume->length(); i++) {
           builder.UnsafeAppend(close->Value(i) / volume->Value(i));
       }
       std::shared_ptr<arrow::DoubleArray> array;
       arrow::Status st = builder.Finish(&array);
       if (!status.ok()) {
           throw std::bad_alloc();
       }
       return array;
   }
   
   // int import_pyarrow()
   // {
   //     return arrow::py::import_pyarrow();
   // }
   
   
   PYBIND11_MODULE(helperfuncs, m) {
       //arrow::py::import_pyarrow();
       m.doc() = "Pyarrow Extensions";
       m.def("vol_adj_close", &vol_adj_close, 
pybind11::call_guard<pybind11::gil_scoped_release>());
       //m.def("import_pyarrow",&import_pyarrow);
   }
   ```
   This was taken from the one example I could find on Pybind11 and Pyarrow 
working together:
   
[https://github.com/vaexio/vaex-arrow-ext](https://github.com/vaexio/vaex-arrow-ext)
   
   I compile this using Cmake with the following excerpt:
   
   **Cmake** 
   ```
   find_package(PythonInterp REQUIRED)
   include_directories(${PYTHON_INCLUDE_DIRS})
   
   add_subdirectory(.../pybind11/)
   
   pybind11_add_module(helperfuncs helperfuncs.cpp  MODULE)
   
   add_compile_options(-std=c++20 -O2 -shared -fPIC)
   find_package(Arrow REQUIRED)
   target_include_directories(helperfuncs PUBLIC 
.../Python_venv_3.8.5/lib/python3.8/site-packages/pyarrow/include)
   target_link_libraries(helperfuncs PRIVATE arrow_shared) #arrow_static
   ```
   
   and call it in python with the following excerpt:
   **test.py**
   ```
   import helperfuncs
   
voladjclose=helperfuncs.vol_adj_close(data['close'].combine_chunks(),data['volume'].combine_chunks())
   ```
   
   where the unchunked **data['close']** is a pyarrow.lib.DoubleArray object 
and unchunked **data['volume']** is a pyarrow.lib.Int64Array object.
   
   Using cmake, this code will compile to a shared library, and can be 
successfully imported into python as the **helperfuncs** module.  However, 
there are 2 issues that arise:
   
   1. The commented lines in the PYBIND11_MODULE function were failed attempts 
at running the **import_pyarrow** function required for C++ extensions.  I 
assume this is a C++ object, as there does seem to be a 
**py::import_pyarrow()** function, but imports into python fail due to a linker 
issue: **ImportError:. ../helperfuncs.cpython-38-x86_64-linux-gnu.so: undefined 
symbol: _ZN5arrow2py14import_pyarrowEv**
   2. Trying to run vol_adj_close() raw on the dataseries as seen in the python 
snippet gives the following type error:
   
   **vol_adj_close(): incompatible function arguments. The following argument 
types are supported:
       1. (arg0: arrow::NumericArray<arrow::DoubleType>, arg1: 
arrow::NumericArray<arrow::Int64Type>) -> 
arrow::NumericArray<arrow::DoubleType>**
       
      This one confuses me greatly, as from what I can see from the 
documentation and code testing is:
      
      pa.Array  ----------------> <class 'pyarrow.lib.Array'>
      pa.NumericArray -----> <class 'pyarrow.lib.NumericArray'>
      
      The 
[documentation](https://arrow.apache.org/docs/python/generated/pyarrow.NumericArray.html?highlight=numericarray#pyarrow.NumericArray)
 seems to indicate that a NumericArray is a specific type of Array, so an 
implicit conversion should not be causing an issue.  I do not see any way to 
convert an Array to NumericArray or vice versa in the documentation otherwise.  
      
      Is there a difference between python's **pyarrow.lib.DoubleArray** and 
C++'s **arrow::NumericArray<arrow::DoubleType>**  ?
   
   
   
   On a final note, I know that there is a division operation that pyarrow can 
use to perform element-wise division, like I have here for this problem, but I 
am trying in this case to see if I can get a C++ extension up and running for 
more complex problems.  
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to