Blizzara opened a new issue, #13737:
URL: https://github.com/apache/datafusion/issues/13737
### Is your feature request related to a problem or challenge?
We're using ListingTable with an object_store. Sometimes our input dataset
may contain empty parquet files (like literally empty as in being 0 bytes in
length). Our spark-based codepaths succeed in "reading" those files (skipping
them), but DataFusion fails hard:
```
ParquetError(EOF("file size of 0 is less than footer"))
```
This error is presumably thrown by
https://github.com/apache/arrow-rs/blob/06a015770098a569b67855dfaa18bdfa7c18ff92/parquet/src/file/metadata/reader.rs#L543.
I think a possible fix/improvement would be to filter out empty files for
example in
https://github.com/apache/datafusion/blob/28e4c64dc738227cd6a4cdf7db48685338582c04/datafusion/core/src/datasource/listing/helpers.rs#L173
and
https://github.com/apache/datafusion/blob/28e4c64dc738227cd6a4cdf7db48685338582c04/datafusion/core/src/datasource/listing/url.rs#L265
with something like
```
.try_filter(|object_meta| object_meta.size > 0))
```
This would align with Spark:
https://github.com/apache/spark/blob/b2c8b3069ef4f5288a5964af0da6f6b23a769e6b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L82C9-L82C23
Thoughts? Alternatively, I can fork ListingTable internally if this isn't
something we want in upstream, or I'm also open to other ideas? 😄
### Describe the solution you'd like
_No response_
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]