Re: [I] [Python][Parquet] ParquetDataset should expose partition_base_dir [arrow]

via GitHub Thu, 24 Apr 2025 05:44:52 -0700


davlee1972 commented on issue #44765:
URL: https://github.com/apache/arrow/issues/44765#issuecomment-2827489274


   Yup I started using partition_base_dir which fixes my problem.. I took it 
one step further to auto populate partition_base_dir if just a list of files is 
used to initialize pyarrow.dataset.dataset(). I'm also tracking this base_dir 
separately since there is the possibility I may want to WRITE new files to the 
dataset and will need to refresh the dataset using the base_dir instead of the 
original list of files.
   
   ```
           # if there is no base_dir then set base_dir based on files common 
path.
           if files and not base_dir:
               # calculate base_dir if not set
               if self.partitioning_flavor and self.partitioning_flavor in [
                   "directory",
                   "hive",
               ]:
                   # get number of partition columns
                   partition_col_count = len(self.partitioning["schema"])
                   # build dict with filename: number of directory parts
                   file_paths_dict = {
                       file_path: len(file_path.split("/"))
                       for file_path in files
                       if self.filesystem.get_file_info(file_path).type
                       == fs.FileType["File"]
                   }
                   # find the most deeply nested file
                   if file_paths_dict:
                       max_dir_len = max(file_paths_dict.values())
                       for k, v in file_paths_dict.items():
                           if v == max_dir_len:
                               file_name = k
                               break
                       # construct base_dir minus partition columns
                       base_dir = "/".join(
                           file_name.split("/")[: -partition_col_count - 1]
                       )
               else:
                   # find common directory across files
                   common_path = os.path.commonpath(files)
                   if common_path:
                       base_dir = common_path
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Python][Parquet] ParquetDataset should expose partition_base_dir [arrow]

Reply via email to