davlee1972 commented on issue #44765: URL: https://github.com/apache/arrow/issues/44765#issuecomment-2827489274
Yup I started using partition_base_dir which fixes my problem.. I took it one step further to auto populate partition_base_dir if just a list of files is used to initialize pyarrow.dataset.dataset(). I'm also tracking this base_dir separately since there is the possibility I may want to WRITE new files to the dataset and will need to refresh the dataset using the base_dir instead of the original list of files. ``` # if there is no base_dir then set base_dir based on files common path. if files and not base_dir: # calculate base_dir if not set if self.partitioning_flavor and self.partitioning_flavor in [ "directory", "hive", ]: # get number of partition columns partition_col_count = len(self.partitioning["schema"]) # build dict with filename: number of directory parts file_paths_dict = { file_path: len(file_path.split("/")) for file_path in files if self.filesystem.get_file_info(file_path).type == fs.FileType["File"] } # find the most deeply nested file if file_paths_dict: max_dir_len = max(file_paths_dict.values()) for k, v in file_paths_dict.items(): if v == max_dir_len: file_name = k break # construct base_dir minus partition columns base_dir = "/".join( file_name.split("/")[: -partition_col_count - 1] ) else: # find common directory across files common_path = os.path.commonpath(files) if common_path: base_dir = common_path ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org