[
https://issues.apache.org/jira/browse/ARROW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15932321#comment-15932321
]
Miki Tebeka commented on ARROW-539:
-----------------------------------
Moving an email conversation to here (where it belongs).
[~tebeka] said:
{quote}
I'm working on ARROW-539 (see
https://github.com/tebeka/arrow/compare/master...ARROW-539)
The code is almost done (sans testing). The issue I'm facing is in _add_parts
(line 246). The first for loop (line 251) adds Column while the 2nd for loop
(line 255) adds Array. This causes Table.from_arrays to complain. I've looks
for a way to convert an Array to Column (or the other way around) and didn't
find anything obvious. I can of course convert to Python list and back but this
is wasteful.
Any pointers on how to do this?
(Later I'll convert the StringArray to a Dictionary as you suggested, starting
simple ...)
{quote}
And [~wesmckinn] answered
{quote}
Looks like _schema_from_arrays must be refactored to have the type
check inside the for loop like
https://github.com/apache/arrow/blob/master/python/pyarrow/table.pyx#L684
Unfortunately, I don't think this is going to work (for performance reasons):
{noformat}
arrays.append(from_pylist([parts[name]] * size))
{noformat}
https://github.com/tebeka/arrow/compare/master...ARROW-539#diff-211255c3faff2b62de57e1dca9f60e13R256
I think you're going to need to create a function (in C++) like
{noformat}
type = int32()
arr = array_from_constant(type, 0)
{noformat}
You can then use these integers to make a DictionaryArray using
something like DictionaryArray.from_arrays
https://github.com/apache/arrow/blob/master/python/pyarrow/array.pyx#L406
Doesn't look like that handles Arrow arrays as inputs. Probably this
deserves a new API that looks like
{{DictionaryArray.from_indices(indices, type).}}
New JIRA for this
https://issues.apache.org/jira/browse/ARROW-643
{quote}
> [Python] Support reading Parquet datasets with standard partition directory
> schemes
> -----------------------------------------------------------------------------------
>
> Key: ARROW-539
> URL: https://issues.apache.org/jira/browse/ARROW-539
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Reporter: Wes McKinney
> Assignee: Miki Tebeka
> Attachments: partitioned_parquet.tar.gz
>
>
> Currently, we only support multi-file directories with a flat structure
> (non-partitioned).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)