adriangb commented on code in PR #19313:
URL: https://github.com/apache/datafusion/pull/19313#discussion_r2616485834
##########
datafusion/datasource-csv/src/file_format.rs:
##########
@@ -445,11 +445,15 @@ impl FileFormat for CsvFormat {
.as_any()
.downcast_ref::<CsvSource>()
.expect("file_source should be a CsvSource");
- let source =
Arc::new(csv_source.clone().with_csv_options(csv_options));
+ let source = Arc::new(
+ csv_source
+ .clone()
+ .with_csv_options(csv_options)
+ .with_newlines_in_values(newlines_in_values),
Review Comment:
isn't newlines_in_values part of csv_options?
##########
datafusion/datasource-csv/src/source.rs:
##########
@@ -297,6 +315,11 @@ impl FileSource for CsvSource {
fn file_type(&self) -> &str {
"csv"
}
+
+ fn has_newlines_in_values(&self) -> bool {
+ self.newlines_in_values
+ }
Review Comment:
This method overlapps with `newlines_in_values`
##########
datafusion/datasource/src/file.rs:
##########
@@ -84,6 +84,19 @@ pub trait FileSource: Send + Sync {
Ok(())
}
+ /// Returns whether this file source has values that may contain newline
characters.
+ ///
+ /// This is primarily relevant for CSV files where quoted values can
contain
+ /// embedded newlines. When this returns `true`, files cannot be
repartitioned
+ /// by byte ranges because record boundaries cannot be determined by simple
+ /// newline scanning.
+ ///
+ /// The default implementation returns `false`. CSV sources should override
+ /// this method to return the appropriate value based on their
configuration.
+ fn has_newlines_in_values(&self) -> bool {
+ false
+ }
Review Comment:
We shouldn't have a method on the `FileSource` trait for this.
##########
datafusion/datasource-csv/src/source.rs:
##########
@@ -176,6 +179,21 @@ impl CsvSource {
conf.options.truncated_rows = Some(truncate_rows);
conf
}
+
+ /// Whether values may contain newline characters.
+ ///
+ /// When enabled, scanning cannot be parallelized across a single file
+ /// because newlines in values prevent determining record boundaries
+ /// by byte offset alone.
+ pub fn newlines_in_values(&self) -> bool {
+ self.newlines_in_values
+ }
+
+ /// Set whether values may contain newline characters
+ pub fn with_newlines_in_values(mut self, newlines_in_values: bool) -> Self
{
+ self.newlines_in_values = newlines_in_values;
+ self
+ }
Review Comment:
Do these not come from the CSVOptions passed to the constructor? Do we need
a separate field?
##########
datafusion/datasource-arrow/src/source.rs:
##########
@@ -460,9 +460,7 @@ impl FileSource for ArrowSource {
// Use the default trait implementation logic for file format
use datafusion_datasource::file_groups::FileGroupPartitioner;
- if config.file_compression_type.is_compressed()
- || config.new_lines_in_values
- {
+ if config.file_compression_type.is_compressed() {
Review Comment:
Why would the Arrow source care about newlines in values in the first place?
Was this just copy pasted code or something?
##########
datafusion/datasource/src/file.rs:
##########
@@ -97,7 +110,7 @@ pub trait FileSource: Send + Sync {
output_ordering: Option<LexOrdering>,
config: &FileScanConfig,
) -> Result<Option<FileScanConfig>> {
- if config.file_compression_type.is_compressed() ||
config.new_lines_in_values {
+ if config.file_compression_type.is_compressed() ||
self.has_newlines_in_values() {
Review Comment:
This does seem like a problem. Maybe the answer is for `CSVSource` to
override this? It doesn't seem like a particularly safe default.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]