[GitHub] [arrow] sanjibansg commented on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

GitBox Tue, 08 Mar 2022 09:46:56 -0800


sanjibansg commented on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1062040407



   > @westonpace asked me to review this as I opened the ticket originally 
based on a user-request. My main criteria for "does this do what the original 
user had in mind" is "can we **read** from a directory of files in which 
sections of the filenames are variables we want to analyse in our data" - and 
it looks like this both does that and enables us to write these files as well, 
which is really cool!
   > 
   > One thing I do want to check though - if I have a load of files called, 
e.g. `foo_bar_whatever_month_year.csv`, is there a way I can just have `month` 
and `year` as variables without the `foo`, `bar`, and `whatever` or would I 
have to read them in as variables and then just drop those columns later?
   
   Yes, we would have to read them in as variables and then drop those columns 
later. Currently, with this PR, the entire filename(discarding the last part 
for eg. `part-0.parquet` or `chunk-0.parquet`) is expected to have the 
partitioning values separated by `_`. In the future, we may need to add the 
functionality to allow custom name separator then just only using the 
underscore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] sanjibansg commented on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Reply via email to