[GitHub] [spark] holdenk commented on a diff in pull request #39907: [SPARK-42359][SQL] Support row skipping when reading CSV files

via GitHub Sat, 08 Jul 2023 16:42:36 -0700


holdenk commented on code in PR #39907:
URL: https://github.com/apache/spark/pull/39907#discussion_r1257386798



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -378,12 +378,15 @@ private[sql] object UnivocityParser {
   def tokenizeStream(
       inputStream: InputStream,
       shouldDropHeader: Boolean,
+      skipLines: Int,
       tokenizer: CsvParser,
       encoding: String): Iterator[Array[String]] = {
+    val handleSkipLines: () => Unit =
+      () => 1.to(skipLines).foreach(_ => tokenizer.parseNext())

Review Comment:
   Whats the behaviour when skipLines is greater than the length of the input 
file?



##########
docs/sql-data-sources-csv.md:
##########
@@ -102,6 +102,12 @@ Data source options of CSV can be set via:
     <td>For reading, uses the first line as names of columns. For writing, 
writes the names of columns as the first line. Note that if the given path is a 
RDD of Strings, this header option will remove all lines same with the header 
if exists. CSV built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
+  <tr>
+    <td><code>skipLines</code></td>
+    <td>0</td>
+    <td>Sets the number of non-empty, uncommented lines to skip before parsing 
CSV files. If the <code>header</code> option is set to <code>true</code>, the 
first line after the number of <code>skipLines</code> will be taken as the 
header.</td>
+    <td>read</td>
+  </tr>

Review Comment:
   Does skipLines apply before or after the filtering? (e.g. if we have 10 
empty lines at the top of partition 1, what is the behaviour)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] holdenk commented on a diff in pull request #39907: [SPARK-42359][SQL] Support row skipping when reading CSV files

Reply via email to