[I] Spark+HDFS cannot decompress parquet files lz4 compressed by DataFusion [datafusion]

via GitHub Mon, 13 Jan 2025 01:47:36 -0800


hayman42 opened a new issue, #14105:
URL: https://github.com/apache/datafusion/issues/14105


   ### Describe the bug
   
   Following error occurs on a spark executor when I execute a spark query on 
parquet files stored in HDFS.
   These files are created by datafusion python api.
   
   ```
   Caused by: java.lang.IllegalArgumentException
        at java.nio.Buffer.limit(Buffer.java:275)
        at 
org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:232)
        at 
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at java.io.DataInputStream.readFully(DataInputStream.java:169)
        at 
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
        at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
        at 
org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
        at 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.<init>(PlainValuesDictionary.java:154)
        at org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:96)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.<init>(VectorizedColumnReader.java:123)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initColumnReader(VectorizedParquetRecordReader.java:423)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:413)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:321)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:219)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:297)
        ... 19 more
   ```
   
   I am new to datafusion ecosystem so sorry in advance if I have misunderstood 
something.
   
   ### To Reproduce
   
   First I used datafusion python api to convert tpch tbl files to parquet. 
Here I used 'lz4' option.
   ```python
   df.write_parquet(parquet_filename, compression='lz4')
   # Then put <parquet_filename> to hdfs
   ```
   
   Then I wrote parquet files to hdfs to process them through spark-sql.
   ```python
   for table in table_names:
       df = spark.read.parquet(path)
       df.createOrReplaceTempView(table)
   ...
   df = spark.sql(sql)
   df.head() # ERROR
   ```
   
   Version Info
   ```
   spark version = 3.4.2
   hadoop version = 3.1.2
   datafusion version = 43.1
   ```
   
   ### Expected behavior
   
   Hadoop Lz4Decompressor should be able to decompress parquet files generated 
by datafusion.
   
   ### Additional context
   
   When I use 'snappy' option, it works fine.
   ```python
   df.write_parquet(parquet_filename, compression='snappy')
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Spark+HDFS cannot decompress parquet files lz4 compressed by DataFusion [datafusion]

Reply via email to