Lance Dacey created ARROW-10694:
-----------------------------------
Summary: [Python] ds.write_dataset() generates empty files for
each final partition
Key: ARROW-10694
URL: https://issues.apache.org/jira/browse/ARROW-10694
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 2.0.0
Environment: Ubuntu 18.04
Python 3.8.6
adlfs master branch
Reporter: Lance Dacey
ds.write_dataset() is generating empty files for the final partition folder
which causes errors when reading the dataset or converting a dataset to a table.
I believe this may be caused by fs.mkdir(). Without the final slash in the
path, an empty file is created in the "dev" container:
{code:java}
fs = fsspec.filesystem(protocol='abfs', account_name=base.login,
account_key=base.password)
fs.mkdir("dev/test2")
{code}
If the final slash is added, a proper folder is created:
{code:java}
fs.mkdir("dev/test2/"){code}
Here is a full example of what happens with ds.write_dataset:
{code:java}
schema = pa.schema(
[
("year", pa.int16()),
("month", pa.int8()),
("day", pa.int8()),
("report_date", pa.date32()),
("employee_id", pa.string()),
("designation", pa.dictionary(index_type=pa.int16(),
value_type=pa.string())),
]
)
part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month",
pa.int8()), ("day", pa.int8())]))
ds.write_dataset(data=table,
base_dir="dev/test-dataset",
basename_template="test-{i}.parquet",
format="parquet",
partitioning=part,
schema=schema,
filesystem=fs)
dataset.files
#sample printed below, note the empty files
[
'dev/test-dataset/2018/1/1/test-0.parquet',
'dev/test-dataset/2018/10/1',
'dev/test-dataset/2018/10/1/test-27.parquet',
'dev/test-dataset/2018/3/1',
'dev/test-dataset/2018/3/1/test-6.parquet',
'dev/test-dataset/2020/1/1',
'dev/test-dataset/2020/1/1/test-2.parquet',
'dev/test-dataset/2020/10/1',
'dev/test-dataset/2020/10/1/test-29.parquet',
'dev/test-dataset/2020/11/1',
'dev/test-dataset/2020/11/1/test-32.parquet',
'dev/test-dataset/2020/2/1',
'dev/test-dataset/2020/2/1/test-5.parquet',
'dev/test-dataset/2020/7/1',
'dev/test-dataset/2020/7/1/test-20.parquet',
'dev/test-dataset/2020/8/1',
'dev/test-dataset/2020/8/1/test-23.parquet',
'dev/test-dataset/2020/9/1',
'dev/test-dataset/2020/9/1/test-26.parquet'
]{code}
As you can see, there is an empty file for each "day" partition. I was not even
able to read the dataset at all until I manually deleted the first empty file
in the dataset (2018/1/1).
I then get an error when I try to use the to_table() method:
{code:java}
OSError Traceback (most recent call last)
<ipython-input-127-6fb0d79c4511> in <module>
----> 1
dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
in
pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
in
pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
in
pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
in pyarrow.lib.check_status()OSError: Could not open parquet input source
'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
{code}
If I manually delete the empty file, I can then use the to_table() function:
{code:java}
dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") ==
10)).to_pandas()
{code}
Is this a bug with pyarrow, adlfs, or fsspec?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)