[
https://issues.apache.org/jira/browse/ARROW-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294650#comment-17294650
]
Lance Dacey commented on ARROW-10440:
-------------------------------------
Will this change allow us to get a list of the blob paths which were saved as
file fragments?
I am currently using fs.glob() to find a list of files which were just saved
using a specific basename_template as a work around.
{code:java}
pattern = filesystem.sep.join([output_path, f"**{base_template}-*"])
files = filesystem.glob(
pattern,
details=False,
invalidate_cache=True,
)
{code}
However, with the legacy write_to_dataset(), I am able to use the
metadata_collector and then create a list of the file paths like this, which is
more convenient (I do not have to worry about generating unique/predictable
basename templates).
{code:java}
files = []
for piece in collector:
files.append(filesystem.sep.join([output_path,
piece.row_group(0).column(0).file_path]))
{code}
I require the lists of blobs to pass along to other Airflow tasks to either
read as a dataset, or I generate a list of filters from the paths.
> [C++][Dataset][Python] Add a callback to visit file writers just before
> Finish()
> --------------------------------------------------------------------------------
>
> Key: ARROW-10440
> URL: https://issues.apache.org/jira/browse/ARROW-10440
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 2.0.0
> Reporter: Ben Kietzman
> Assignee: Ben Kietzman
> Priority: Major
> Fix For: 4.0.0
>
>
> This will fill the role of (for example) {{metadata_collector}} or allow
> stats to be embedded in IPC file footer metadata.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)