timvw commented on issue #2445: URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119938026
I do agree that the capabilities we actually need are rather limited (compared to a full filesystem spec) and it makes sense to not name those FileSystem then. Should we also define what we expect in terms of ACID properties? @alamb The globbing is mainly relevant in raw/ingestion folders... Eg: we have end up with a structure such as: /nyc-taxidata/input/yellow_tripdata_2021-11.csv /nyc-taxidata/input/yellow_tripdata_2021-12.csv /nyc-taxidata/input/yellow_tripdata_2022-01.csv /nyc-taxidata/input/green_tripdata_2021-12.csv /nyc-taxidata/input/green_tripdata_2022-01.csv /nyc-taxidata/input/green_tripdata_2022-02.csv In a typical job we would then process and prepare the data for consumption: /nyc-taxidata/accepted/yellow_tripdata/year=2022/month=1/blah.parquet /nyc-taxidata/accepted/green_tripdata/year=2022/month=1/blah.parquet I don't need access to all sorts of key filters (compared to all key filters in a system such as [HBase](https://hbase.apache.org/2.3/apidocs/index.html) but globbing is not something I would push back to the end-user (In hadoop this is also supported by alternative (s3, azure) hadoop filesystem implementations) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
