westonpace commented on code in PR #13959:
URL: https://github.com/apache/arrow/pull/13959#discussion_r955239446
##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -350,23 +351,25 @@ class ARROW_DS_EXPORT FileWriter {
/// \brief Options for writing a dataset.
struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
/// Options for individual fragment writing.
- std::shared_ptr<FileWriteOptions> file_write_options;
+ std::shared_ptr<FileWriteOptions> file_write_options =
+ CsvFileFormat().DefaultWriteOptions();
Review Comment:
I think parquet is probably the best default format for writing datasets as
it is generally going to be friendlier on the disk size. Although I could be
convinced that IPC is better since, as you point out, parquet is also an
optional component. I very much agree it should not be CSV.
##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -350,23 +351,25 @@ class ARROW_DS_EXPORT FileWriter {
/// \brief Options for writing a dataset.
struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
/// Options for individual fragment writing.
- std::shared_ptr<FileWriteOptions> file_write_options;
+ std::shared_ptr<FileWriteOptions> file_write_options =
+ CsvFileFormat().DefaultWriteOptions();
/// FileSystem into which a dataset will be written.
- std::shared_ptr<fs::FileSystem> filesystem;
+ std::shared_ptr<fs::FileSystem> filesystem =
+ std::make_shared<arrow::fs::LocalFileSystem>();
/// Root directory into which the dataset will be written.
std::string base_dir;
/// Partitioning used to generate fragment paths.
- std::shared_ptr<Partitioning> partitioning;
+ std::shared_ptr<Partitioning> partitioning = Partitioning::Default();
/// Maximum number of partitions any batch may be written into, default is
1K.
int max_partitions = 1024;
/// Template string used to generate fragment basenames.
/// {i} will be replaced by an auto incremented integer.
- std::string basename_template;
+ std::string basename_template = "data_{i}.arrow";
Review Comment:
The extension should be based on the format. We should add a `const
std::string& default_extension()` method to `FileFormat`. Then we should
default this to the empty string. Then, in the dataset writer, if this is an
empty string, we should use `"part-{i}." + format.default_extension()`. This
mimics what is done in python (and we could probably remove the python logic
too). Then we should update the docstring for this field to reflect this
behavior.
##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -386,7 +389,7 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
/// and only write the row groups to the disk when sufficient rows have
accumulated.
/// The final row group size may be less than this value and other options
such as
/// `max_open_files` or `max_rows_per_file` lead to smaller row group sizes.
- uint64_t min_rows_per_group = 0;
+ uint64_t min_rows_per_group = 10;
Review Comment:
If partitioning, it is very possible to end up with tiny row groups. For
example, if we partition by year, and a batch comes in with 1 million rows
spread across 1000 years you would end up with 1000 row groups with 1000 rows
which is undesirable.
However, the default here should be 1 << 20 (1Mi)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]