[jira] [Created] (ARROW-11260) [C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning

Joris Van den Bossche (Jira) Fri, 15 Jan 2021 03:25:05 -0800

Joris Van den Bossche created ARROW-11260:
---------------------------------------------


             Summary: [C++][Dataset] Don't require dictionaries for reading 
dataset with schema-based Partitioning
                 Key: ARROW-11260
                 URL: https://issues.apache.org/jira/browse/ARROW-11260
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


As a follow-up on ARROW-10247 (see also 
https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We currently 
require the user to pass manually specified dictionary values when reading a 
dataset with a Partitioning based on a schema with dictionary typed fields. 

In practice that means that the user for example needs to parse the file paths 
to get all the possible values the partition field can take, while Arrow will 
then afterwards again do the same to construct the dataset object. 
_Naively_, it seems that it should be possible to let Arrow infer the 
dictionary _values_, even when providing an explicit schema with a dictionary 
field for the Partitioning (i.e. so when not letting the partitioning schema 
itself be inferred from the file paths).

An example use case is when you have a Partitioning schema with both dictionary 
and non-dictionary fields. When discovering the schema, you can only have all 
or nothing (all dictionary fields or no dictionary fields).

cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11260) [C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning

Reply via email to