[
https://issues.apache.org/jira/browse/ARROW-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227033#comment-17227033
]
David Li commented on ARROW-10480:
----------------------------------
I believe this is due to
[a56e483126|https://github.com/apache/arrow/commit/a56e483126] where
ParquetWriter now uses open_output_stream instead of open for non-legacy
filesystems; the former detects a compression method by looking at the file
extension. Furthermore, passing a legacy filesystem explicitly doesn't fix it,
because pyarrow.fs._ensure_filesystem maps the legacy ones to new ones.
[~jorisvandenbossche] what do you think? We often use filenames with
.parquet.snappy or .snappy.parquet, the former of which now gets tripped up -
but not sure if that's common to do.
> [Python] Parquet write_table creates gzipped Parquet file, not Parquet with
> gzip compression
> --------------------------------------------------------------------------------------------
>
> Key: ARROW-10480
> URL: https://issues.apache.org/jira/browse/ARROW-10480
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Reporter: David Li
> Priority: Major
> Fix For: 2.0.1
>
>
> Writing "foo.parquet.gz" in Arrow 2.0.0 creates a gzipped Parquet file, which
> Arrow can't read back, while in 1.0.1 it created a Parquet file with gzip
> compression. Hence I think this is a regression.
> In Arrow 2.0.0:
> {noformat}
> > pip freeze
> numpy==1.19.4
> pyarrow==2.0.0
> > python write.py
> Arrow: 2.0.0
> Read/write with PyArrow:
> test.pyarrow.gz: gzip compressed data, from Unix, original size modulo 2^32
> 630
> Traceback (most recent call last):
> File "write.py", line 12, in <module>
> print(pq.read_table("test.pyarrow.gz"))
> File
> "/home/lidavidm/Code/twosigma/arrow-regression/venv/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1607, in read_table
> dataset = _ParquetDatasetV2(
> File
> "/home/lidavidm/Code/twosigma/arrow-regression/venv/lib/python3.8/site-packages/pyarrow/parquet.py",
> line 1452, in __init__
> [fragment], schema=fragment.physical_schema,
> File "pyarrow/_dataset.pyx", line 761, in
> pyarrow._dataset.Fragment.physical_schema.__get__
> File "pyarrow/error.pxi", line 122, in
> pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Could not open parquet input source 'test.pyarrow.gz': Invalid:
> Parquet magic bytes not found in footer. Either the file is corrupted or this
> is not a parquet file. {noformat}
> But in Arrow 1.0.1:
> {noformat}
> > pip freeze
> numpy==1.19.4
> pyarrow==1.0.1
> > python write.py
> Arrow: 1.0.1
> Read/write with PyArrow:
> test.pyarrow.gz: Apache Parquet
> pyarrow.Table
> ints: int64 {noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import subprocess
> print("Arrow:", pa.__version__)
> print()
> print("Read/write with PyArrow:")
> table = pa.table([pa.array(range(4))], names=["ints"])
> pq.write_table(table, "test.pyarrow.gz", compression="GZIP")
> subprocess.check_call(["file", "test.pyarrow.gz"])
> print(pq.read_table("test.pyarrow.gz"))
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)