[
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637767#comment-16637767
]
Adar Dembo commented on KUDU-1276:
----------------------------------
[~jtbirdsell] [your recent Pandas patch|https://gerrit.cloudera.org/c/10809/]
added support for reading from Kudu into Pandas, but not in the way described
by Wes in this bug's description. Not being familiar with Pandas, is there more
work to be done here on the reading side? Or is it done, albeit in a different
way than Wes suggested?
> Add a vectorized read/write interface for pandas DataFrame objects
> ------------------------------------------------------------------
>
> Key: KUDU-1276
> URL: https://issues.apache.org/jira/browse/KUDU-1276
> Project: Kudu
> Issue Type: New Feature
> Components: client, python
> Reporter: Wes McKinney
> Assignee: Jordan Birdsell
> Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data
> representation suitable for use in pandas. Ideally should not create more
> than one PyString object for each observed string value. Binary can be
> encoded as UTF8 string, while Timestamp will need to be converted to
> nanoseconds for pandas
> This would also give a very performant and relatively GIL-free data ingest
> path to the Kudu (and Kudu consumers like Impala) without a great deal of
> Python+Cython coding.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)