[jira] [Created] (ARROW-15073) [C++][Python] LZ4-compressed parquet files are unreadable by (py)spark

Jira Sat, 11 Dec 2021 09:27:05 -0800

Jorge Leitão created ARROW-15073:
------------------------------------

             Summary: [C++][Python] LZ4-compressed parquet files are unreadable 
by (py)spark
                 Key: ARROW-15073
                 URL: https://issues.apache.org/jira/browse/ARROW-15073
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Jorge Leitão



The following snipped shows the issue

{code:java}
import pyarrow as pa  # pyarrow==6.0.1
import pyarrow.parquet
import pyspark.sql  # pyspark==3.1.2


path = "bla.parquet"

t = pa.table(
    [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
    schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
)

pyarrow.parquet.write_table(
    t,
    path,
    use_dictionary=False,
    compression="LZ4",
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()

result = spark.read.parquet(path).select("int64").collect()
{code}

This fails with a failure in the Thrift protocol:

{code:java}
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, 
null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 
00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, 
count:1)])
{code}

Found while debugging the root cause of 
https://github.com/pola-rs/polars/issues/2018



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15073) [C++][Python] LZ4-compressed parquet files are unreadable by (py)spark

Reply via email to