[I] Ignore empty (parquet) files when using ListingTable [datafusion]

via GitHub Wed, 11 Dec 2024 12:03:09 -0800


Blizzara opened a new issue, #13737:
URL: https://github.com/apache/datafusion/issues/13737


   ### Is your feature request related to a problem or challenge?
   
   We're using ListingTable with an object_store. Sometimes our input dataset 
may contain empty parquet files (like literally empty as in being 0 bytes in 
length). Our spark-based codepaths succeed in "reading" those files (skipping 
them), but DataFusion fails hard:
   ```
   ParquetError(EOF("file size of 0 is less than footer"))
   ```
   
   This error is presumably thrown by 
https://github.com/apache/arrow-rs/blob/06a015770098a569b67855dfaa18bdfa7c18ff92/parquet/src/file/metadata/reader.rs#L543.
 
   
   I think a possible fix/improvement would be to filter out empty files for 
example in 
https://github.com/apache/datafusion/blob/28e4c64dc738227cd6a4cdf7db48685338582c04/datafusion/core/src/datasource/listing/helpers.rs#L173
 and 
https://github.com/apache/datafusion/blob/28e4c64dc738227cd6a4cdf7db48685338582c04/datafusion/core/src/datasource/listing/url.rs#L265
   with something like 
   ```
   .try_filter(|object_meta| object_meta.size > 0))
   ```
   
   This would align with Spark: 
https://github.com/apache/spark/blob/b2c8b3069ef4f5288a5964af0da6f6b23a769e6b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L82C9-L82C23
   
   Thoughts? Alternatively, I can fork ListingTable internally if this isn't 
something we want in upstream, or I'm also open to other ideas? 😄 
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Ignore empty (parquet) files when using ListingTable [datafusion]

Reply via email to