[jira] [Commented] (ARROW-11260) [C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning

Joris Van den Bossche (Jira) Fri, 15 Jan 2021 03:29:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265942#comment-17265942
 ]


Joris Van den Bossche commented on ARROW-11260:
-----------------------------------------------

Sidenote: this actually also worked for a short time in pyarrow 0.17.0 (and I 
added a test about that in the PR fixing it 
(https://github.com/apache/arrow/pull/6641#issuecomment-600746259), but that 
got apparently lost in a rebase ;)), but I suppose this was changed in pyarrow 
1.0.0 after ensuring that the dictionary-typed partition fields "know" the full 
dictionary of all possible values the dataset 
(https://github.com/apache/arrow/pull/7536#issuecomment-649500017). 

The test that worked at that time we forgot to include:

{code:python}
@pytest.mark.pandas
def test_partitioning_dictionary_key(mockfs):
    # ARROW-8088 specifying partition key as dictionary type
    schema = pa.schema([
        pa.field('group', pa.dictionary(pa.int8(), pa.int32())),
        pa.field('key', pa.dictionary(pa.int8(), pa.string()))
    ])
    part = ds.DirectoryPartitioning(schema)

    dataset = ds.dataset(
        "subdir", format="parquet", filesystem=mockfs, partitioning=part
    )
    table = dataset.to_table()

    assert table.column('group').type.equals(schema.types[0])
    assert table.column('group').to_pylist() == [1] * 5 + [2] * 5
    assert table.column('key').type.equals(schema.types[1])
    assert table.column('key').to_pylist() == ['xxx'] * 5 + ['yyy'] * 5
{code}

(but note that this doesn't check that each chunk of the ChunkedArray for the 
partition columns have all dictionary values, which is the feature that was 
added later)

> [C++][Dataset] Don't require dictionaries for reading dataset with 
> schema-based Partitioning
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11260
>                 URL: https://issues.apache.org/jira/browse/ARROW-11260
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>             Fix For: 4.0.0
>
>
> As a follow-up on ARROW-10247 (see also 
> https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We 
> currently require the user to pass manually specified dictionary values when 
> reading a dataset with a Partitioning based on a schema with dictionary typed 
> fields. 
> In practice that means that the user for example needs to parse the file 
> paths to get all the possible values the partition field can take, while 
> Arrow will then afterwards again do the same to construct the dataset object. 
> _Naively_, it seems that it should be possible to let Arrow infer the 
> dictionary _values_, even when providing an explicit schema with a dictionary 
> field for the Partitioning (i.e. so when not letting the partitioning schema 
> itself be inferred from the file paths).
> An example use case is when you have a Partitioning schema with both 
> dictionary and non-dictionary fields. When discovering the schema, you can 
> only have all or nothing (all dictionary fields or no dictionary fields).
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11260) [C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning

Reply via email to