[GitHub] Mister-Meeseeks commented on a change in pull request #23830: [SPARK-26935][SQL]Skip DataFrameReader's CSV first line scan when not used

GitBox Wed, 20 Feb 2019 11:17:18 -0800

Mister-Meeseeks commented on a change in pull request #23830: 
[SPARK-26935][SQL]Skip DataFrameReader's CSV first line scan when not used
URL: https://github.com/apache/spark/pull/23830#discussion_r258608085


 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
 ##########
 @@ -508,7 +508,12 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
       sparkSession.sessionState.conf.sessionLocalTimeZone)
     val filteredLines: Dataset[String] =
       CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions)
-    val maybeFirstLine: Option[String] = filteredLines.take(1).headOption
+    val maybeFirstLine: Option[String] =
+      if (userSpecifiedSchema.isEmpty || parsedOptions.headerFlag) {
+        filteredLines.take(1).headOption
+      } else {
+        None
 
 Review comment:
   That's a good point. From a software engineering standpoint it makes more 
sense for the logic to be owned by the CSV specific objects, not the 
`DataFrameReader` class. (Also to add first line is also consumed by 
`CSVUtils.filterHeaderLine `)
   
   ~~What if `maybeFirstLine` was a lazy type? The advantage is that the three 
downstream consumers could all internally define their logic. We don't have to 
worry about it inside `DataFrameReader`. If none of the consumers end up using 
the first line, then it doesn't get collected. If they do, then it gets 
collected once.~~ 
   
   *[Edit: Nevermind, this approach won't work with Scala functions]*
   
   What if we created a simple `CSVHeaderBox` class that contains a reference 
to the Dataset? It doesn't call take() on the constructor. But the first time 
the `unbox` function is called, it collects the header then stores it inside 
the class. The advantage is the logic of whether to consume or not consume the 
first line is deferred to the underlying consumers. And we still avoid 
wastefully collecting more than once (or at all if not needed).
   
   The downside is that everywhere the header is passed as a string or 
Option[string], would instead take a reference to `CSVHeaderBox`. In addition 
to adding the new class, it will also require changing the type signature for 
the following methods: 
   
   * CSVUtils::filterHeaderLine
   * CSVHeaderChecker::checkHeaderColumnNames
   * CSVDataSource::inferFromDataset
   
   So the PR will result in a larger footprint to the codebase. If you think 
this is a good approach, let me know and I can make a commit for review.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] Mister-Meeseeks commented on a change in pull request #23830: [SPARK-26935][SQL]Skip DataFrameReader's CSV first line scan when not used

Reply via email to