[ 
https://issues.apache.org/jira/browse/ARROW-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-11260:
--------------------------------

    Assignee: David Li  (was: Ben Kietzman)

> [C++][Dataset] Don't require dictionaries for reading dataset with 
> schema-based Partitioning
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11260
>                 URL: https://issues.apache.org/jira/browse/ARROW-11260
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: David Li
>            Priority: Major
>              Labels: dataset
>             Fix For: 4.0.0
>
>
> As a follow-up on ARROW-10247 (see also 
> https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We 
> currently require the user to pass manually specified dictionary values when 
> reading a dataset with a Partitioning based on a schema with dictionary typed 
> fields. 
> In practice that means that the user for example needs to parse the file 
> paths to get all the possible values the partition field can take, while 
> Arrow will then afterwards again do the same to construct the dataset object. 
> _Naively_, it seems that it should be possible to let Arrow infer the 
> dictionary _values_, even when providing an explicit schema with a dictionary 
> field for the Partitioning (i.e. so when not letting the partitioning schema 
> itself be inferred from the file paths).
> An example use case is when you have a Partitioning schema with both 
> dictionary and non-dictionary fields. When discovering the schema, you can 
> only have all or nothing (all dictionary fields or no dictionary fields).
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to