Mister-Meeseeks commented on a change in pull request #23830: 
[SPARK-26935][SQL]Skip DataFrameReader's CSV first line scan when not used
URL: https://github.com/apache/spark/pull/23830#discussion_r258608085
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
 ##########
 @@ -508,7 +508,12 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
       sparkSession.sessionState.conf.sessionLocalTimeZone)
     val filteredLines: Dataset[String] =
       CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions)
-    val maybeFirstLine: Option[String] = filteredLines.take(1).headOption
+    val maybeFirstLine: Option[String] =
+      if (userSpecifiedSchema.isEmpty || parsedOptions.headerFlag) {
+        filteredLines.take(1).headOption
+      } else {
+        None
 
 Review comment:
   That's a good point. From a software engineering standpoint it makes more 
sense for the logic to be owned by the CSV specific objects, not the 
`DataFrameReader` class. (Also to add first line is also consumed by 
`CSVUtils.filterHeaderLine `)
   
   ~~What if `maybeFirstLine` was a lazy type? The advantage is that the three 
downstream consumers could all internally define their logic. We don't have to 
worry about it inside `DataFrameReader`. If none of the consumers end up using 
the first line, then it doesn't get collected. If they do, then it gets 
collected once. ~~ *[Edit: Nevermind, this approach won't work with Scala 
functions]*

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to