Mister-Meeseeks commented on a change in pull request #23830: 
[SPARK-26935][SQL]Skip DataFrameReader's CSV first line scan when not used
URL: https://github.com/apache/spark/pull/23830#discussion_r258608085
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
 ##########
 @@ -508,7 +508,12 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
       sparkSession.sessionState.conf.sessionLocalTimeZone)
     val filteredLines: Dataset[String] =
       CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions)
-    val maybeFirstLine: Option[String] = filteredLines.take(1).headOption
+    val maybeFirstLine: Option[String] =
+      if (userSpecifiedSchema.isEmpty || parsedOptions.headerFlag) {
+        filteredLines.take(1).headOption
+      } else {
+        None
 
 Review comment:
   That's a good point. From a software engineering standpoint it makes more 
sense for the logic to be owned by the CSV specific objects, not the 
`DataFrameReader` class. (Also to add first line is also consumed by 
`CSVUtils.filterHeaderLine `)
   
   What if `maybeFirstLine` was a lazy type? The advantage is that the three 
downstream consumers could all internally define their logic. We don't have to 
worry about it inside `DataFrameReader`. If none of the consumers end up using 
the first line, then it doesn't get collected. If they do, then it gets 
collected once. 
   
   The consequence is that it requires changing the first line argument type to 
lazy in the following methods:
   
   * `TextInputCSVDataSource::inferFromDataset`
   * `CSVHeaderChecker::checkHeaderColumnNames`
   *  `CSVUtils::filterHeaderLine`
   
   So the PR change would have a larger footprint. Also I'm not a Scala expert, 
so not sure if using a lazy type would have any other weird performance 
consequences. But if you think this is a good approach, let me know and I can 
make a commit.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to