hayman42 opened a new issue, #14105:
URL: https://github.com/apache/datafusion/issues/14105
### Describe the bug
Following error occurs on a spark executor when I execute a spark query on
parquet files stored in HDFS.
These files are created by datafusion python api.
```
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at
org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:232)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
at
org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
at
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.<init>(PlainValuesDictionary.java:154)
at org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:96)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.<init>(VectorizedColumnReader.java:123)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initColumnReader(VectorizedParquetRecordReader.java:423)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:413)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:321)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:219)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:297)
... 19 more
```
I am new to datafusion ecosystem so sorry in advance if I have misunderstood
something.
### To Reproduce
First I used datafusion python api to convert tpch tbl files to parquet.
Here I used 'lz4' option.
```python
df.write_parquet(parquet_filename, compression='lz4')
# Then put <parquet_filename> to hdfs
```
Then I wrote parquet files to hdfs to process them through spark-sql.
```python
for table in table_names:
df = spark.read.parquet(path)
df.createOrReplaceTempView(table)
...
df = spark.sql(sql)
df.head() # ERROR
```
Version Info
```
spark version = 3.4.2
hadoop version = 3.1.2
datafusion version = 43.1
```
### Expected behavior
Hadoop Lz4Decompressor should be able to decompress parquet files generated
by datafusion.
### Additional context
When I use 'snappy' option, it works fine.
```python
df.write_parquet(parquet_filename, compression='snappy')
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]