[jira] [Commented] (ARROW-13939) how to do resampling of arrow table using cython

Weston Pace (Jira) Mon, 13 Sep 2021 15:12:52 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414611#comment-17414611
 ]


Weston Pace commented on ARROW-13939:
-------------------------------------

> the cython documentation is single page with not much useful info.

PRs are always welcome.

> lets say i know that its of IntScalar, how to extract it int a = 
> doSomethingOnCResult(val)

Scalar's have an "as_py" method.  You can inspect that to see how it is working 
in Cython.

>  I believe this resembles to what i'm doing, having this resampling code in 
> Cython. If I'm wrong please let me know.

You are not wrong, but no one else is doing this in Cython so you will need to 
come up with a lot of functionality yourself and it will be a considerable 
amount of work.  The pyarrow philosophy has been to keep all array manipulation 
in C++.  The existing Cython code is pretty much limited to metadata 
manipulation.  The easiest path forward (in terms of man-hours of effort) is 
likely to be extending Arrow-C++.  Alternatively, you could investigate if 
something like this is supported by datafusion.  There is some initial support 
for python bindings for datafusion in development.  I do believe that these 
kinds of functions will come to Arrow-C++ (and thus pyarrow) someday but I 
can't give you any kind of estimate as there is no open JIRA ticket for them. 

>  I have no idea how to do this. Please can you point me to some resources.

* To access an Array's buffers in python (as a bytes object) you can do 
arr.buffers()[buffer_index].to_pybytes()
* To access an Array's buffers in cython you can do something similar but the 
method to call on the buffer is "data()" (for const uint8_t*) or 
"mutable_data()" (for uint8_t*)

The format of these buffers is described in the [Arrow Columnar 
Format](https://arrow.apache.org/docs/format/Columnar.html) and advice on how 
to manipulate them is beyond the scope of a JIRA issue.

> Everything is in cython. so I pass my larger table from python to cython 
> resampling function.

It sounds like your starting data is a pyarrow "Table" and so the data will be 
in C++ (there are no python objects for the individual array elements).  You 
will probably want to use the [array 
builders](https://arrow.apache.org/docs/cpp/api/builder.html) to build up your 
result API but I do not believe there is any Cython API for these.


> how to do resampling of arrow table using cython
> ------------------------------------------------
>
>                 Key: ARROW-13939
>                 URL: https://issues.apache.org/jira/browse/ARROW-13939
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>            Reporter: krishna deepak
>            Priority: Minor
>
> Please can someone point me to resources, how to write a resampling code in 
> cython for Arrow table.
>  # Will iterating the whole table be slow in cython?
>  # which is the best to use to append new elements to. Is there a way i 
> create an empty table of same schema and keep appending to it. Or should I 
> use vectors/list and then pass them to create a table.
> Performance is very important for me. Any help is highly appreciated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13939) how to do resampling of arrow table using cython

Reply via email to