tustvold opened a new issue, #2208:
URL: https://github.com/apache/arrow-datafusion/issues/2208

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Hive compatible metastores, such as AWS Glue (#2206) do not store the 
individual files within a partition, and instead rely on listing the files in 
object storage at query time.
   
   This becomes problematic when interacting with data that is either:
   
   * Not partitioned in the way that Hive expects
   * Rewrites data leaving parquet files behind that no longer form part of the 
most recent snapshot (e.g. Delta Lake / IOx)
   
   **Describe the solution you'd like**
   
   Much like we currently support a FileFormat of CSV or Parquet, I would like 
to support a FileFormat of `SymlinkTextInputFormat`. This is just a 
newline-delimited list of files, stored in object storage alongside a table or 
partition.
   
   The best documentation for this functionality I can find is 
[here](https://athena.guide/articles/stitching-tables-with-symlinktextinputformat/),
 and there is documentation 
[here](https://docs.delta.io/latest/presto-integration.html) on how it is used 
to enable inter-operation between Presto and Data Lake. 
   
   *I'm not entirely sure how the query engine determines the format of the 
symlink targets, but I guess it must use the file suffix??*
   
   **Describe alternatives you've considered**
   
   We could not support this
   
   **Additional context**
   
   I am not hugely familiar with the precise inner-workings of the Hive 
ecosystem, as I've only interacted with tooling that uses it under-the-hood. I 
therefore could be mistaken on some aspect, if so please feel free to correct 
me :smile: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to