[GitHub] [arrow] westonpace edited a comment on pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

GitBox Wed, 30 Jun 2021 01:42:48 -0700


westonpace edited a comment on pull request #10628:
URL: https://github.com/apache/arrow/pull/10628#issuecomment-871209883



   This PR now includes #10619 .  Some notes for review:
   
    * Ben and I took parallel approaches at this.  Ben's approach was to mirror 
the C++ API and create a FileWriter to wrap CFileWriter.  My approach was to 
create a WrittenFile class which is just the path & metadata (if present) and 
expose that as `file_visitor`.  I'm happy to switch if we feel the other is 
better.  My rationale was "FileWriter is an internal class, best to hide the 
concept and only expose what is needed."
    * ARROW-10440 added a visitor to be called right before finish was called 
on a file.  For metadata_collector to work I needed to create a visitor that is 
called right after finish is called on the file so I added the second visitor 
as part of this PR.
    * The existing metadata_collector is a bit clunky when working with 
partitioned datasets.  The _metadata file does not contain the partition 
columns.  This does appear to be the intent (with common_metadata, if it 
exists, containing the full schema) but without a working spark/hadoop setup I 
can't be completely certain.
    * The existing tests for metadata_collector were calling write_dataset on 
the same directory multiple times and expecting multiple files to be created 
(since the legacy writer uses a guid for naming).  This seems somewhat related 
to ARROW-12358.  I just updated the test to call write_dataset once with a 
partitioned column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace edited a comment on pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

Reply via email to