Mister-Meeseeks commented on a change in pull request #23830:
[SPARK-26935][SQL]Skip DataFrameReader's CSV first line scan when not used
URL: https://github.com/apache/spark/pull/23830#discussion_r258608085
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##########
@@ -508,7 +508,12 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
sparkSession.sessionState.conf.sessionLocalTimeZone)
val filteredLines: Dataset[String] =
CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions)
- val maybeFirstLine: Option[String] = filteredLines.take(1).headOption
+ val maybeFirstLine: Option[String] =
+ if (userSpecifiedSchema.isEmpty || parsedOptions.headerFlag) {
+ filteredLines.take(1).headOption
+ } else {
+ None
Review comment:
That's a good point. From a software engineering standpoint it makes more
sense for the logic to be owned by the CSV specific objects, not the
`DataFrameReader` class. (Also to add first line is also consumed by
`CSVUtils.filterHeaderLine `)
~~What if `maybeFirstLine` was a lazy type? The advantage is that the three
downstream consumers could all internally define their logic. We don't have to
worry about it inside `DataFrameReader`. If none of the consumers end up using
the first line, then it doesn't get collected. If they do, then it gets
collected once.~~
*[Edit: Nevermind, this approach won't work with Scala functions]*
What if we created a simple `CSVHeaderBox` class that contains a reference
to the Dataset. It doesn't call take() on the constructor. But the first time
the `unbox` function is called, it collects the header then stores it inside
the class. The advantage is the logic of whether to consume or not consume the
first line is deferred to the underlying consumers. And we still avoid
wastefully collecting more than once (or at all if not needed).
The downside is that everywhere where the header is passed as a string or
Option[string], would instead take a reference to `CSVHeaderBox`. In addition
to adding the new class, it will also require changing the type signature for
the following methods:
* CSVUtils::filterHeaderLine
* CSVHeaderChecker::checkHeaderColumnNames
* CSVDataSource::inferFromDataset
So the PR will result in a larger footprint. If you think this is a good
approach, let me know and I can make a commit for review.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]