[jira] [Comment Edited] (ARROW-13939) how to do resampling of arrow table using cython

Weston Pace (Jira) Mon, 13 Sep 2021 15:13:38 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414611#comment-17414611
 ]


Weston Pace edited comment on ARROW-13939 at 9/13/21, 10:12 PM:
----------------------------------------------------------------

> the cython documentation is single page with not much useful info.

PRs are always welcome.

> lets say i know that its of IntScalar, how to extract it int a = 
> doSomethingOnCResult(val)

Scalar's have an "as_py" method.  You can inspect that to see how it is working 
in Cython.

>  I believe this resembles to what i'm doing, having this resampling code in 
> Cython. If I'm wrong please let me know.

You are not wrong, but no one else is doing this in Cython so you will need to 
come up with a lot of functionality yourself and it will be a considerable 
amount of work.  The pyarrow philosophy has been to keep all array manipulation 
in cpp.  The existing Cython code is pretty much limited to metadata 
manipulation.  The easiest path forward (in terms of man-hours of effort) is 
likely to be extending Arrow-cpp.  Alternatively, you could investigate if 
something like this is supported by datafusion.  There is some initial support 
for python bindings for datafusion in development.  I do believe that these 
kinds of functions will come to Arrow-cpp (and thus pyarrow) someday but I 
can't give you any kind of estimate as there is no open JIRA ticket for them. 

>  I have no idea how to do this. Please can you point me to some resources.

* To access an Array's buffers in python (as a bytes object) you can do 
arr.buffers()[buffer_index].to_pybytes()
* To access an Array's buffers in cython you can do something similar but the 
method to call on the buffer is "data()" (for const uint8_t*) or 
"mutable_data()" (for uint8_t*)

The format of these buffers is described in the [Arrow Columnar 
Format](https://arrow.apache.org/docs/format/Columnar.html) and advice on how 
to manipulate them is beyond the scope of a JIRA issue.

> Everything is in cython. so I pass my larger table from python to cython 
> resampling function.

It sounds like your starting data is a pyarrow "Table" and so the data will be 
in C++ (there are no python objects for the individual array elements).  You 
will probably want to use the [array 
builders](https://arrow.apache.org/docs/cpp/api/builder.html) to build up your 
result API but I do not believe there is any Cython API for these.



was (Author: westonpace):
> the cython documentation is single page with not much useful info.

PRs are always welcome.

> lets say i know that its of IntScalar, how to extract it int a = 
> doSomethingOnCResult(val)

Scalar's have an "as_py" method.  You can inspect that to see how it is working 
in Cython.

>  I believe this resembles to what i'm doing, having this resampling code in 
> Cython. If I'm wrong please let me know.

You are not wrong, but no one else is doing this in Cython so you will need to 
come up with a lot of functionality yourself and it will be a considerable 
amount of work.  The pyarrow philosophy has been to keep all array manipulation 
in C++.  The existing Cython code is pretty much limited to metadata 
manipulation.  The easiest path forward (in terms of man-hours of effort) is 
likely to be extending Arrow-C++.  Alternatively, you could investigate if 
something like this is supported by datafusion.  There is some initial support 
for python bindings for datafusion in development.  I do believe that these 
kinds of functions will come to Arrow-C++ (and thus pyarrow) someday but I 
can't give you any kind of estimate as there is no open JIRA ticket for them. 

>  I have no idea how to do this. Please can you point me to some resources.

* To access an Array's buffers in python (as a bytes object) you can do 
arr.buffers()[buffer_index].to_pybytes()
* To access an Array's buffers in cython you can do something similar but the 
method to call on the buffer is "data()" (for const uint8_t*) or 
"mutable_data()" (for uint8_t*)

The format of these buffers is described in the [Arrow Columnar 
Format](https://arrow.apache.org/docs/format/Columnar.html) and advice on how 
to manipulate them is beyond the scope of a JIRA issue.

> Everything is in cython. so I pass my larger table from python to cython 
> resampling function.

It sounds like your starting data is a pyarrow "Table" and so the data will be 
in C++ (there are no python objects for the individual array elements).  You 
will probably want to use the [array 
builders](https://arrow.apache.org/docs/cpp/api/builder.html) to build up your 
result API but I do not believe there is any Cython API for these.


> how to do resampling of arrow table using cython
> ------------------------------------------------
>
>                 Key: ARROW-13939
>                 URL: https://issues.apache.org/jira/browse/ARROW-13939
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>            Reporter: krishna deepak
>            Priority: Minor
>
> Please can someone point me to resources, how to write a resampling code in 
> cython for Arrow table.
>  # Will iterating the whole table be slow in cython?
>  # which is the best to use to append new elements to. Is there a way i 
> create an empty table of same schema and keep appending to it. Or should I 
> use vectors/list and then pass them to create a table.
> Performance is very important for me. Any help is highly appreciated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13939) how to do resampling of arrow table using cython

Reply via email to