[
https://issues.apache.org/jira/browse/ARROW-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414611#comment-17414611
]
Weston Pace edited comment on ARROW-13939 at 9/13/21, 10:12 PM:
----------------------------------------------------------------
> the cython documentation is single page with not much useful info.
PRs are always welcome.
> lets say i know that its of IntScalar, how to extract it int a =
> doSomethingOnCResult(val)
Scalar's have an "as_py" method. You can inspect that to see how it is working
in Cython.
> I believe this resembles to what i'm doing, having this resampling code in
> Cython. If I'm wrong please let me know.
You are not wrong, but no one else is doing this in Cython so you will need to
come up with a lot of functionality yourself and it will be a considerable
amount of work. The pyarrow philosophy has been to keep all array manipulation
in cpp. The existing Cython code is pretty much limited to metadata
manipulation. The easiest path forward (in terms of man-hours of effort) is
likely to be extending Arrow-cpp. Alternatively, you could investigate if
something like this is supported by datafusion. There is some initial support
for python bindings for datafusion in development. I do believe that these
kinds of functions will come to Arrow-cpp (and thus pyarrow) someday but I
can't give you any kind of estimate as there is no open JIRA ticket for them.
> I have no idea how to do this. Please can you point me to some resources.
* To access an Array's buffers in python (as a bytes object) you can do
arr.buffers()[buffer_index].to_pybytes()
* To access an Array's buffers in cython you can do something similar but the
method to call on the buffer is "data()" (for const uint8_t*) or
"mutable_data()" (for uint8_t*)
The format of these buffers is described in the [Arrow Columnar
Format](https://arrow.apache.org/docs/format/Columnar.html) and advice on how
to manipulate them is beyond the scope of a JIRA issue.
> Everything is in cython. so I pass my larger table from python to cython
> resampling function.
It sounds like your starting data is a pyarrow "Table" and so the data will be
in C++ (there are no python objects for the individual array elements). You
will probably want to use the [array
builders](https://arrow.apache.org/docs/cpp/api/builder.html) to build up your
result API but I do not believe there is any Cython API for these.
was (Author: westonpace):
> the cython documentation is single page with not much useful info.
PRs are always welcome.
> lets say i know that its of IntScalar, how to extract it int a =
> doSomethingOnCResult(val)
Scalar's have an "as_py" method. You can inspect that to see how it is working
in Cython.
> I believe this resembles to what i'm doing, having this resampling code in
> Cython. If I'm wrong please let me know.
You are not wrong, but no one else is doing this in Cython so you will need to
come up with a lot of functionality yourself and it will be a considerable
amount of work. The pyarrow philosophy has been to keep all array manipulation
in C++. The existing Cython code is pretty much limited to metadata
manipulation. The easiest path forward (in terms of man-hours of effort) is
likely to be extending Arrow-C++. Alternatively, you could investigate if
something like this is supported by datafusion. There is some initial support
for python bindings for datafusion in development. I do believe that these
kinds of functions will come to Arrow-C++ (and thus pyarrow) someday but I
can't give you any kind of estimate as there is no open JIRA ticket for them.
> I have no idea how to do this. Please can you point me to some resources.
* To access an Array's buffers in python (as a bytes object) you can do
arr.buffers()[buffer_index].to_pybytes()
* To access an Array's buffers in cython you can do something similar but the
method to call on the buffer is "data()" (for const uint8_t*) or
"mutable_data()" (for uint8_t*)
The format of these buffers is described in the [Arrow Columnar
Format](https://arrow.apache.org/docs/format/Columnar.html) and advice on how
to manipulate them is beyond the scope of a JIRA issue.
> Everything is in cython. so I pass my larger table from python to cython
> resampling function.
It sounds like your starting data is a pyarrow "Table" and so the data will be
in C++ (there are no python objects for the individual array elements). You
will probably want to use the [array
builders](https://arrow.apache.org/docs/cpp/api/builder.html) to build up your
result API but I do not believe there is any Cython API for these.
> how to do resampling of arrow table using cython
> ------------------------------------------------
>
> Key: ARROW-13939
> URL: https://issues.apache.org/jira/browse/ARROW-13939
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Python
> Reporter: krishna deepak
> Priority: Minor
>
> Please can someone point me to resources, how to write a resampling code in
> cython for Arrow table.
> # Will iterating the whole table be slow in cython?
> # which is the best to use to append new elements to. Is there a way i
> create an empty table of same schema and keep appending to it. Or should I
> use vectors/list and then pass them to create a table.
> Performance is very important for me. Any help is highly appreciated.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)