[
https://issues.apache.org/jira/browse/ARROW-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345292#comment-16345292
]
ASF GitHub Bot commented on ARROW-1705:
---------------------------------------
pitrou commented on a change in pull request #1530: ARROW-1705: [Python] allow
building array from dicts
URL: https://github.com/apache/arrow/pull/1530#discussion_r164792503
##########
File path: cpp/src/arrow/python/builtin_convert.cc
##########
@@ -722,25 +736,60 @@ class ListConverter : public
TypedConverterVisitor<ListBuilder, ListConverter> {
public:
Status Init(ArrayBuilder* builder) override;
- Status AppendItem(const OwnedRef& item) {
+ Status AppendItem(PyObject* obj) override {
RETURN_NOT_OK(typed_builder_->Append());
- PyObject* item_obj = item.obj();
- const auto list_size = static_cast<int64_t>(PySequence_Size(item_obj));
- return value_converter_->AppendData(item_obj, list_size);
+ const auto list_size = static_cast<int64_t>(PySequence_Size(obj));
+ return value_converter_->AppendMultiple(obj, list_size);
}
protected:
std::shared_ptr<SeqConverter> value_converter_;
};
+class StructConverter : public TypedConverterVisitor<StructBuilder,
StructConverter> {
+ public:
+ Status Init(ArrayBuilder* builder) override;
+
+ Status AppendItem(PyObject* obj) override {
+ RETURN_NOT_OK(typed_builder_->Append());
+ if (!PyDict_Check(obj)) {
+ return Status::TypeError("dict value expected for struct type");
+ }
+ // NOTE we're ignoring any extraneous dict items
+ for (int i = 0; i < num_fields_; i++) {
+ PyObject* nameobj = PyList_GET_ITEM(field_name_list_.obj(), i);
+ PyObject* valueobj = PyDict_GetItem(obj, nameobj); // borrowed
+ RETURN_IF_PYERROR();
+ RETURN_NOT_OK(value_converters_[i]->AppendSingle(valueobj ? valueobj :
Py_None));
Review comment:
Ok, I tried it with the following micro-benchmark:
```
$ python -m timeit -s "import pyarrow as pa; ty=pa.struct([pa.field('x',
pa.int32())]); data=[None]*1000000" "pa.array(data, type=ty)"
```
* unpatched: 49.3 msec per loop
* with unique_ptr instead of shared_ptr: 44.3 msec per loop
* with raw pointers in addition to shared_ptr: 43.6 msec per loop
Since an additional array of raw pointers adds some complication, I'm
tempted to go with the unique_ptr solution (which AFAICS looks sane).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Python] Create StructArray from sequence of dicts given a known data type
> --------------------------------------------------------------------------
>
> Key: ARROW-1705
> URL: https://issues.apache.org/jira/browse/ARROW-1705
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Reporter: Wes McKinney
> Assignee: Antoine Pitrou
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See https://github.com/apache/arrow/issues/1217
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)