[GitHub] [arrow] westonpace commented on pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

GitBox Wed, 30 Jun 2021 01:40:56 -0700


westonpace commented on pull request #10628:
URL: https://github.com/apache/arrow/pull/10628#issuecomment-871209883



   This PR now includes #10619 .  Some notes for review:
   
    * Ben and I took parallel approaches at this.  Ben's approach was to mirror 
the C++ API and create a FileWriter to wrap CFileWriter.  My approach was to 
create a WrittenFile class which is just the path & metadata (if present) and 
expose that as `file_visitor`.  I'm happy to switch if we feel the other is 
better.  My rationale was "FileWriter is an internal class, best to hide the 
concept and only expose what is needed."
    * The existing metadata_collector is a bit clunky when working with 
partitioned datasets.  The _metadata file does not contain the partition 
columns.  This does appear to be the intent (with common_metadata, if it 
exists, containing the full schema) but without a working spark/hadoop setup I 
can't be completely certain.
    * The existing tests for metadata_collector were calling write_dataset on 
the same directory multiple times and expecting multiple files to be created 
(since the legacy writer uses a guid for naming).  This seems somewhat related 
to ARROW-12358.  I just updated the test to call write_dataset once with a 
partitioned column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

Reply via email to