[
https://issues.apache.org/jira/browse/ARROW-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047208#comment-17047208
]
Micah Kornfield commented on ARROW-3247:
----------------------------------------
{quote}
<map-repetition> group <name> (MAP) {
repeated group key_value {
required <key-type> key;
<value-repetition> <value-type> value;
}
}
{quote}
Sorry I'm not seeing the different in Map schema from what is listed in the
parquet spec (pasted above)?
Other issues covering this:
https://issues.apache.org/jira/browse/ARROW-1644 (this is the one I'm actively
updating with subtasks)
https://issues.apache.org/jira/browse/ARROW-2587?filter=-1
https://issues.apache.org/jira/browse/ARROW-1599?filter=-1
Discussion on maiiling list:
https://mail-archives.apache.org/mod_mbox/arrow-dev/202002.mbox/%3CCAJPUwMBP_CyfsVn0nCQx%3DP6AFuGaAcYRr-x9Y0GtJ7d2QTZRHA%40mail.gmail.com%3E
> [Python] Support spark parquet array and map types
> --------------------------------------------------
>
> Key: ARROW-3247
> URL: https://issues.apache.org/jira/browse/ARROW-3247
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Martin Durant
> Priority: Minor
> Labels: parquet
>
> As far I understand, there is already some support for nested
> array/dict/structs in arrow. However, spark Map and List types are structured
> one level deeper (I believe to allow for both NULL and empty entries).
> Surprisingly, fastparquet can load these. I do not know the plan for
> arbitrary nested object support, but it should be made clear.
> Schema of spark-generated file from the fastparquet test suite:
> {code:java}
> - spark_schema:
> | - map_op_op: MAP, OPTIONAL
> | - key_value: REPEATED
> | | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, OPTIONAL
> | - map_op_req: MAP, OPTIONAL
> | - key_value: REPEATED
> | | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, REQUIRED
> | - map_req_op: MAP, REQUIRED
> | - key_value: REPEATED
> | | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, OPTIONAL
> | - map_req_req: MAP, REQUIRED
> | - key_value: REPEATED
> | | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, REQUIRED
> | - arr_op_op: LIST, OPTIONAL
> | - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, OPTIONAL
> | - arr_op_req: LIST, OPTIONAL
> | - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, REQUIRED
> | - arr_req_op: LIST, REQUIRED
> | - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, OPTIONAL
> - arr_req_req: LIST, REQUIRED
> - list: REPEATED
> - element: BYTE_ARRAY, UTF8, REQUIRED
> {code}
> (please forgive that some of this has already been mentioned elsewhere; this
> is one of the entries in the list at
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful
> in fastparquet)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)