westonpace commented on pull request #10628:
URL: https://github.com/apache/arrow/pull/10628#issuecomment-871209883
This PR now includes #10619 . Some notes for review:
* Ben and I took parallel approaches at this. Ben's approach was to mirror
the C++ API and create a FileWriter to wrap CFileWriter. My approach was to
create a WrittenFile class which is just the path & metadata (if present) and
expose that as `file_visitor`. I'm happy to switch if we feel the other is
better. My rationale was "FileWriter is an internal class, best to hide the
concept and only expose what is needed."
* The existing metadata_collector is a bit clunky when working with
partitioned datasets. The _metadata file does not contain the partition
columns. This does appear to be the intent (with common_metadata, if it
exists, containing the full schema) but without a working spark/hadoop setup I
can't be completely certain.
* The existing tests for metadata_collector were calling write_dataset on
the same directory multiple times and expecting multiple files to be created
(since the legacy writer uses a guid for naming). This seems somewhat related
to ARROW-12358. I just updated the test to call write_dataset once with a
partitioned column.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]