[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

GitBox Fri, 06 May 2022 12:14:02 -0700


timvw commented on issue #2445:
URL: 
https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119938026


   I do agree that the capabilities we actually need are rather limited 
(compared to a full filesystem spec) and it makes sense to not name those 
FileSystem then. Should we also define what we expect in terms of ACID 
properties?
   
   
   @alamb The globbing is mainly relevant in raw/ingestion folders... 
   
   Eg: we have end up with a structure such as:
   /nyc-taxidata/input/yellow_tripdata_2021-11.csv
   /nyc-taxidata/input/yellow_tripdata_2021-12.csv
   /nyc-taxidata/input/yellow_tripdata_2022-01.csv
   /nyc-taxidata/input/green_tripdata_2021-12.csv
   /nyc-taxidata/input/green_tripdata_2022-01.csv
   /nyc-taxidata/input/green_tripdata_2022-02.csv
   
   In a typical job we would then process and prepare the data for consumption:
   /nyc-taxidata/accepted/yellow_tripdata/year=2022/month=1/blah.parquet
   /nyc-taxidata/accepted/green_tripdata/year=2022/month=1/blah.parquet
   
   I don't need access to all sorts of key filters (compared to all key filters 
in a system such as [HBase](https://hbase.apache.org/2.3/apidocs/index.html) 
but globbing is not something I would push back to the end-user (In hadoop this 
is also supported by alternative (s3, azure) hadoop filesystem implementations)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Reply via email to