[ 
https://issues.apache.org/jira/browse/ARROW-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294650#comment-17294650
 ] 

Lance Dacey commented on ARROW-10440:
-------------------------------------

Will this change allow us to get a list of the blob paths which were saved as 
file fragments?

I am currently using fs.glob() to find a list of files which were just saved 
using a specific basename_template as a work around.


{code:java}
    pattern = filesystem.sep.join([output_path, f"**{base_template}-*"])
    files = filesystem.glob(
        pattern,
        details=False,
        invalidate_cache=True,
    )
{code}

However, with the legacy write_to_dataset(), I am able to use the 
metadata_collector and then create a list of the file paths like this, which is 
more convenient (I do not have to worry about generating unique/predictable 
basename templates).


{code:java}
    files = []
    for piece in collector:
        files.append(filesystem.sep.join([output_path, 
piece.row_group(0).column(0).file_path]))
{code}


I require the lists of blobs to pass along to other Airflow tasks to either 
read as a dataset, or I generate a list of filters from the paths.


> [C++][Dataset][Python] Add a callback to visit file writers just before 
> Finish()
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-10440
>                 URL: https://issues.apache.org/jira/browse/ARROW-10440
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 2.0.0
>            Reporter: Ben Kietzman
>            Assignee: Ben Kietzman
>            Priority: Major
>             Fix For: 4.0.0
>
>
> This will fill the role of (for example) {{metadata_collector}} or allow 
> stats to be embedded in IPC file footer metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to