[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638033#comment-16638033
 ] 

Wes McKinney commented on KUDU-1276:
------------------------------------

I just took a look at the patch. The approach is not very efficient. The 
optimal path for an application like Kudu to go to pandas is via Arrow. I would 
suggest writing a C++ converter to yield an Arrow (columnar) record batch, wrap 
that in a {{pyarrow.RecordBatch}}, then call {{to_pandas()}} (cf 
http://wesmckinney.com/blog/high-perf-arrow-to-pandas/)

> Add a vectorized read/write interface for pandas DataFrame objects
> ------------------------------------------------------------------
>
>                 Key: KUDU-1276
>                 URL: https://issues.apache.org/jira/browse/KUDU-1276
>             Project: Kudu
>          Issue Type: New Feature
>          Components: client, python
>            Reporter: Wes McKinney
>            Assignee: Jordan Birdsell
>            Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to