[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638292#comment-16638292
 ] 

Wes McKinney commented on KUDU-1276:
------------------------------------

Ha, I have a completely laughable amount of spare bandwidth right now =) I'm 
working on growing my team at Ursa Labs and have Kudu-Arrow-pandas improvements 
on at least the 12-18 month horizon. In the meantime if someone could add 
Apache Arrow to the Kudu build toolchain that would make things easier for us 
to tackle this when we get to it (if no one beats us there)

> Add a vectorized read/write interface for pandas DataFrame objects
> ------------------------------------------------------------------
>
>                 Key: KUDU-1276
>                 URL: https://issues.apache.org/jira/browse/KUDU-1276
>             Project: Kudu
>          Issue Type: New Feature
>          Components: client, python
>            Reporter: Wes McKinney
>            Assignee: Jordan Birdsell
>            Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to