Mister-Meeseeks commented on a change in pull request #23830:
[SPARK-26935][SQL]Skip DataFrameReader's CSV first line scan when not used
URL: https://github.com/apache/spark/pull/23830#discussion_r258608085
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##########
@@ -508,7 +508,12 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
sparkSession.sessionState.conf.sessionLocalTimeZone)
val filteredLines: Dataset[String] =
CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions)
- val maybeFirstLine: Option[String] = filteredLines.take(1).headOption
+ val maybeFirstLine: Option[String] =
+ if (userSpecifiedSchema.isEmpty || parsedOptions.headerFlag) {
+ filteredLines.take(1).headOption
+ } else {
+ None
Review comment:
That's a good point. From a software engineering standpoint it makes more
sense for the logic to be owned by the CSV specific objects, not the
`DataFrameReader` class. (Also to add first line is also consumed by
`CSVUtils.filterHeaderLine `)
What if `maybeFirstLine` was a lazy type? The advantage is that the three
downstream consumers could all internally define their logic. We don't have to
worry about it inside `DataFrameReader`. If none of the consumers end up using
the first line, then it doesn't get collected. If they do, then it gets
collected once.
The consequence is that it requires changing the first line argument type to
lazy in the following methods:
* `TextInputCSVDataSource::inferFromDataset`
* `CSVHeaderChecker::checkHeaderColumnNames`
* `CSVUtils::filterHeaderLine`
So the PR change would have a larger footprint. Also I'm not a Scala expert,
so not sure if using a lazy type would have any other weird performance
consequences. But if you think this is a good approach, let me know and I can
make a commit.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]