Re: [I] [C++][Python] Add option to include partitioning columns in basename_template's filename [arrow]

via GitHub Wed, 13 Mar 2024 14:25:49 -0700


davlee1972 commented on issue #39447:
URL: https://github.com/apache/arrow/issues/39447#issuecomment-1995879572


   I managed to implement a custom partitioning class with a "scan" function 
which works for dataset read(s).. But dataset.write_data() only works with one 
of the three predefined "flavors".. 
   
   There is a file_visitor() callback that can be used to rename the file, but 
it is very hacky to use especially if you want to replace existing files.
   
   Also implementing a custom partitioning class that works in conjunction with 
hive partitioning is a bit awkward..
   
   I think a proper implementation would be to split directory partitioning and 
file partitioning into separate configurations.
   
   hive and directory partitioning would used for subdirectories.
   filename partitioning with a filename template would be used for files.
   filename_template = 
"fx_rates.{fx_currency}_{fx_year}_{fx_month}.part-{i}.parquet"
   default would be filename_template = "{fx_currency}_{fx_year}_{fx_month}_" + 
basename_template. 
   
   Also adding {i} should be optional and in practice should be used only if 
parquet files exceed 128 megs..
   
   We have a lot of existing parquet files with naming conventions which do not 
conform to any of the current flavors available.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][Python] Add option to include partitioning columns in basename_template's filename [arrow]

Reply via email to