holdenk commented on code in PR #39907:
URL: https://github.com/apache/spark/pull/39907#discussion_r1257386798
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -378,12 +378,15 @@ private[sql] object UnivocityParser {
def tokenizeStream(
inputStream: InputStream,
shouldDropHeader: Boolean,
+ skipLines: Int,
tokenizer: CsvParser,
encoding: String): Iterator[Array[String]] = {
+ val handleSkipLines: () => Unit =
+ () => 1.to(skipLines).foreach(_ => tokenizer.parseNext())
Review Comment:
Whats the behaviour when skipLines is greater than the length of the input
file?
##########
docs/sql-data-sources-csv.md:
##########
@@ -102,6 +102,12 @@ Data source options of CSV can be set via:
<td>For reading, uses the first line as names of columns. For writing,
writes the names of columns as the first line. Note that if the given path is a
RDD of Strings, this header option will remove all lines same with the header
if exists. CSV built-in functions ignore this option.</td>
<td>read/write</td>
</tr>
+ <tr>
+ <td><code>skipLines</code></td>
+ <td>0</td>
+ <td>Sets the number of non-empty, uncommented lines to skip before parsing
CSV files. If the <code>header</code> option is set to <code>true</code>, the
first line after the number of <code>skipLines</code> will be taken as the
header.</td>
+ <td>read</td>
+ </tr>
Review Comment:
Does skipLines apply before or after the filtering? (e.g. if we have 10
empty lines at the top of partition 1, what is the behaviour)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]