[jira] [Commented] (ARROW-7345) [Python] Writing partitions with NaNs silently drops data

Joris Van den Bossche (Jira) Mon, 09 Dec 2019 06:07:19 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991615#comment-16991615
 ]


Joris Van den Bossche commented on ARROW-7345:
----------------------------------------------

[~karldw] you are correct that it is currently the use of pandas' groupby that 
results in this behaviour.

There are vague plans to implement this groupby step natively in Arrow. See eg 
ARROW-2628, ARROW-5002

> [Python] Writing partitions with NaNs silently drops data
> ---------------------------------------------------------
>
>                 Key: ARROW-7345
>                 URL: https://issues.apache.org/jira/browse/ARROW-7345
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>            Reporter: Karl Dunkle Werner
>            Priority: Minor
>              Labels: dataset, parquet
>
> When writing a partitioned table, if the partitioning column has NA values, 
> they're silently dropped. I think it would be helpful if there was a warning. 
> Even better, from my perspective, would be writing out those partitions with 
> a directory name like {{partition_col=NaN}}. 
> Here's a small example where only the {{b = 2}} group is written out and the 
> {{b = NaN}} group is dropped.
> {code:python}
> import os
> import tempfile
> import pyarrow.json
> import pyarrow.parquet
> from pathlib import Path
> # Create a dataset with NaN:
> json_str = """
> {"a": 1, "b": 2}
> {"a": 2, "b": null}
> """
> with tempfile.NamedTemporaryFile() as tf:
>     tf = Path(tf.name)
>     tf.write_text(json_str)
>     table = pyarrow.json.read_json(tf)
> # Write out a partitioned dataset, using the NaN-containing column
> with tempfile.TemporaryDirectory() as out_dir:
>     pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"])
>     print(os.listdir(out_dir))
>     read_table = pyarrow.parquet.read_table(out_dir)
> print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row")
> # Output:
> #> ['b=2.0']
> #> Wrote out 2 rows, read back 1 row
> {code}
>  
> It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} 
> here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7345) [Python] Writing partitions with NaNs silently drops data

Reply via email to