davlee1972 commented on issue #44765:
URL: https://github.com/apache/arrow/issues/44765#issuecomment-2827489274
Yup I started using partition_base_dir which fixes my problem.. I took it
one step further to auto populate partition_base_dir if just a list of files is
used to initialize pyarrow.dataset.dataset(). I'm also tracking this base_dir
separately since there is the possibility I may want to WRITE new files to the
dataset and will need to refresh the dataset using the base_dir instead of the
original list of files.
```
# if there is no base_dir then set base_dir based on files common
path.
if files and not base_dir:
# calculate base_dir if not set
if self.partitioning_flavor and self.partitioning_flavor in [
"directory",
"hive",
]:
# get number of partition columns
partition_col_count = len(self.partitioning["schema"])
# build dict with filename: number of directory parts
file_paths_dict = {
file_path: len(file_path.split("/"))
for file_path in files
if self.filesystem.get_file_info(file_path).type
== fs.FileType["File"]
}
# find the most deeply nested file
if file_paths_dict:
max_dir_len = max(file_paths_dict.values())
for k, v in file_paths_dict.items():
if v == max_dir_len:
file_name = k
break
# construct base_dir minus partition columns
base_dir = "/".join(
file_name.split("/")[: -partition_col_count - 1]
)
else:
# find common directory across files
common_path = os.path.commonpath(files)
if common_path:
base_dir = common_path
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]