Jorge Leitão created ARROW-15073:
------------------------------------
Summary: [C++][Python] LZ4-compressed parquet files are unreadable
by (py)spark
Key: ARROW-15073
URL: https://issues.apache.org/jira/browse/ARROW-15073
Project: Apache Arrow
Issue Type: Bug
Reporter: Jorge Leitão
The following snipped shows the issue
{code:java}
import pyarrow as pa # pyarrow==6.0.1
import pyarrow.parquet
import pyspark.sql # pyspark==3.1.2
path = "bla.parquet"
t = pa.table(
[pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
)
pyarrow.parquet.write_table(
t,
path,
use_dictionary=False,
compression="LZ4",
)
spark = pyspark.sql.SparkSession.builder.getOrCreate()
result = spark.read.parquet(path).select("int64").collect()
{code}
This fails with a failure in the Thrift protocol:
{code:java}
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException:
Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64,
encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10,
total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4,
statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00,
null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00
00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN,
count:1)])
{code}
Found while debugging the root cause of
https://github.com/pola-rs/polars/issues/2018
--
This message was sent by Atlassian Jira
(v8.20.1#820001)