[GitHub] [arrow] westonpace commented on a diff in pull request #13959: ARROW-15409: [C++] The C++ API for writing datasets could be improved

GitBox Thu, 25 Aug 2022 10:31:04 -0700


westonpace commented on code in PR #13959:
URL: https://github.com/apache/arrow/pull/13959#discussion_r955239446



##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -350,23 +351,25 @@ class ARROW_DS_EXPORT FileWriter {
 /// \brief Options for writing a dataset.
 struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
   /// Options for individual fragment writing.
-  std::shared_ptr<FileWriteOptions> file_write_options;
+  std::shared_ptr<FileWriteOptions> file_write_options =
+      CsvFileFormat().DefaultWriteOptions();

Review Comment:
   I think parquet is probably the best default format for writing datasets as 
it is generally going to be friendlier on the disk size.  Although I could be 
convinced that IPC is better since, as you point out, parquet is also an 
optional component.  I very much agree it should not be CSV.



##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -350,23 +351,25 @@ class ARROW_DS_EXPORT FileWriter {
 /// \brief Options for writing a dataset.
 struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
   /// Options for individual fragment writing.
-  std::shared_ptr<FileWriteOptions> file_write_options;
+  std::shared_ptr<FileWriteOptions> file_write_options =
+      CsvFileFormat().DefaultWriteOptions();
 
   /// FileSystem into which a dataset will be written.
-  std::shared_ptr<fs::FileSystem> filesystem;
+  std::shared_ptr<fs::FileSystem> filesystem =
+      std::make_shared<arrow::fs::LocalFileSystem>();
 
   /// Root directory into which the dataset will be written.
   std::string base_dir;
 
   /// Partitioning used to generate fragment paths.
-  std::shared_ptr<Partitioning> partitioning;
+  std::shared_ptr<Partitioning> partitioning = Partitioning::Default();
 
   /// Maximum number of partitions any batch may be written into, default is 
1K.
   int max_partitions = 1024;
 
   /// Template string used to generate fragment basenames.
   /// {i} will be replaced by an auto incremented integer.
-  std::string basename_template;
+  std::string basename_template = "data_{i}.arrow";

Review Comment:
   The extension should be based on the format.  We should add a `const 
std::string& default_extension()` method to `FileFormat`.  Then we should 
default this to the empty string.  Then, in the dataset writer, if this is an 
empty string, we should use `"part-{i}." + format.default_extension()`.  This 
mimics what is done in python (and we could probably remove the python logic 
too).  Then we should update the docstring for this field to reflect this 
behavior.



##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -386,7 +389,7 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
   /// and only write the row groups to the disk when sufficient rows have 
accumulated.
   /// The final row group size may be less than this value and other options 
such as
   /// `max_open_files` or `max_rows_per_file` lead to smaller row group sizes.
-  uint64_t min_rows_per_group = 0;
+  uint64_t min_rows_per_group = 10;

Review Comment:
   If partitioning, it is very possible to end up with tiny row groups.  For 
example, if we partition by year, and a batch comes in with 1 million rows 
spread across 1000 years you would end up with 1000 row groups with 1000 rows 
which is undesirable.
   
   However, the default here should be 1 << 20 (1Mi)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #13959: ARROW-15409: [C++] The C++ API for writing datasets could be improved

Reply via email to