[ https://issues.apache.org/jira/browse/ARROW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15932321#comment-15932321 ]
Miki Tebeka commented on ARROW-539: ----------------------------------- Moving an email conversation to here (where it belongs). [~tebeka] said: {quote} I'm working on ARROW-539 (see https://github.com/tebeka/arrow/compare/master...ARROW-539) The code is almost done (sans testing). The issue I'm facing is in _add_parts (line 246). The first for loop (line 251) adds Column while the 2nd for loop (line 255) adds Array. This causes Table.from_arrays to complain. I've looks for a way to convert an Array to Column (or the other way around) and didn't find anything obvious. I can of course convert to Python list and back but this is wasteful. Any pointers on how to do this? (Later I'll convert the StringArray to a Dictionary as you suggested, starting simple ...) {quote} And [~wesmckinn] answered {quote} Looks like _schema_from_arrays must be refactored to have the type check inside the for loop like https://github.com/apache/arrow/blob/master/python/pyarrow/table.pyx#L684 Unfortunately, I don't think this is going to work (for performance reasons): {noformat} arrays.append(from_pylist([parts[name]] * size)) {noformat} https://github.com/tebeka/arrow/compare/master...ARROW-539#diff-211255c3faff2b62de57e1dca9f60e13R256 I think you're going to need to create a function (in C++) like {noformat} type = int32() arr = array_from_constant(type, 0) {noformat} You can then use these integers to make a DictionaryArray using something like DictionaryArray.from_arrays https://github.com/apache/arrow/blob/master/python/pyarrow/array.pyx#L406 Doesn't look like that handles Arrow arrays as inputs. Probably this deserves a new API that looks like {{DictionaryArray.from_indices(indices, type).}} New JIRA for this https://issues.apache.org/jira/browse/ARROW-643 {quote} > [Python] Support reading Parquet datasets with standard partition directory > schemes > ----------------------------------------------------------------------------------- > > Key: ARROW-539 > URL: https://issues.apache.org/jira/browse/ARROW-539 > Project: Apache Arrow > Issue Type: New Feature > Components: Python > Reporter: Wes McKinney > Assignee: Miki Tebeka > Attachments: partitioned_parquet.tar.gz > > > Currently, we only support multi-file directories with a flat structure > (non-partitioned). -- This message was sent by Atlassian JIRA (v6.3.15#6346)