[
https://issues.apache.org/jira/browse/ARROW-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265942#comment-17265942
]
Joris Van den Bossche commented on ARROW-11260:
-----------------------------------------------
Sidenote: this actually also worked for a short time in pyarrow 0.17.0 (and I
added a test about that in the PR fixing it
(https://github.com/apache/arrow/pull/6641#issuecomment-600746259), but that
got apparently lost in a rebase ;)), but I suppose this was changed in pyarrow
1.0.0 after ensuring that the dictionary-typed partition fields "know" the full
dictionary of all possible values the dataset
(https://github.com/apache/arrow/pull/7536#issuecomment-649500017).
The test that worked at that time we forgot to include:
{code:python}
@pytest.mark.pandas
def test_partitioning_dictionary_key(mockfs):
# ARROW-8088 specifying partition key as dictionary type
schema = pa.schema([
pa.field('group', pa.dictionary(pa.int8(), pa.int32())),
pa.field('key', pa.dictionary(pa.int8(), pa.string()))
])
part = ds.DirectoryPartitioning(schema)
dataset = ds.dataset(
"subdir", format="parquet", filesystem=mockfs, partitioning=part
)
table = dataset.to_table()
assert table.column('group').type.equals(schema.types[0])
assert table.column('group').to_pylist() == [1] * 5 + [2] * 5
assert table.column('key').type.equals(schema.types[1])
assert table.column('key').to_pylist() == ['xxx'] * 5 + ['yyy'] * 5
{code}
(but note that this doesn't check that each chunk of the ChunkedArray for the
partition columns have all dictionary values, which is the feature that was
added later)
> [C++][Dataset] Don't require dictionaries for reading dataset with
> schema-based Partitioning
> --------------------------------------------------------------------------------------------
>
> Key: ARROW-11260
> URL: https://issues.apache.org/jira/browse/ARROW-11260
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
> Fix For: 4.0.0
>
>
> As a follow-up on ARROW-10247 (see also
> https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We
> currently require the user to pass manually specified dictionary values when
> reading a dataset with a Partitioning based on a schema with dictionary typed
> fields.
> In practice that means that the user for example needs to parse the file
> paths to get all the possible values the partition field can take, while
> Arrow will then afterwards again do the same to construct the dataset object.
> _Naively_, it seems that it should be possible to let Arrow infer the
> dictionary _values_, even when providing an explicit schema with a dictionary
> field for the Partitioning (i.e. so when not letting the partitioning schema
> itself be inferred from the file paths).
> An example use case is when you have a Partitioning schema with both
> dictionary and non-dictionary fields. When discovering the schema, you can
> only have all or nothing (all dictionary fields or no dictionary fields).
> cc [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)