[
https://issues.apache.org/jira/browse/ARROW-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Li reassigned ARROW-11260:
--------------------------------
Assignee: David Li (was: Ben Kietzman)
> [C++][Dataset] Don't require dictionaries for reading dataset with
> schema-based Partitioning
> --------------------------------------------------------------------------------------------
>
> Key: ARROW-11260
> URL: https://issues.apache.org/jira/browse/ARROW-11260
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: David Li
> Priority: Major
> Labels: dataset
> Fix For: 4.0.0
>
>
> As a follow-up on ARROW-10247 (see also
> https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We
> currently require the user to pass manually specified dictionary values when
> reading a dataset with a Partitioning based on a schema with dictionary typed
> fields.
> In practice that means that the user for example needs to parse the file
> paths to get all the possible values the partition field can take, while
> Arrow will then afterwards again do the same to construct the dataset object.
> _Naively_, it seems that it should be possible to let Arrow infer the
> dictionary _values_, even when providing an explicit schema with a dictionary
> field for the Partitioning (i.e. so when not letting the partitioning schema
> itself be inferred from the file paths).
> An example use case is when you have a Partitioning schema with both
> dictionary and non-dictionary fields. When discovering the schema, you can
> only have all or nothing (all dictionary fields or no dictionary fields).
> cc [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)