HyukjinKwon commented on a change in pull request #32658:
URL: https://github.com/apache/spark/pull/32658#discussion_r640517394
##########
File path: docs/sql-data-sources-csv.md
##########
@@ -38,3 +36,223 @@ Spark SQL provides `spark.read().csv("file_name")` to read
a file or directory o
</div>
</div>
+
+## Data Source Option
+
+Data source options of CSV can be set via:
+* the `.option`/`.options` methods of
+ * `DataFrameReader`
+ * `DataFrameWriter`
+ * `DataStreamReader`
+ * `DataStreamWriter`
+* the built-in functions below
+ * `from_csv`
+ * `to_csv`
+ * `schema_of_csv`
+* `OPTIONS` clause at [CREATE TABLE USING
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+
+<table class="table">
+ <tr><th><b>Property
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+ <tr>
+ <td><code>sep</code></td>
+ <td>,</td>
+ <td>Sets a separator (one or more characters) for each field and
value.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>encoding</code></td>
+ <td><code>UTF-8</code> for reading, not set for writing</td>
+ <td>For reading, decodes the CSV files by the given encoding type.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>quote</code></td>
+ <td>"</td>
+ <td>Sets a single character used for escaping quoted values where the
separator can be part of the value. If you would like to turn off quotations,
you need to set an empty string. If an empty string is set, it uses
<code>u0000</code> (null character).</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>quoteAll</code></td>
+ <td>false</td>
+ <td>A flag indicating whether all values should always be enclosed in
quotes. It only escapes values containing a quote character by default.</td>
+ <td>write</td>
+ </tr>
+ <tr>
+ <td><code>escape</code></td>
+ <td>\</td>
+ <td>Sets a single character used for escaping quotes inside an already
quoted value.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>escapeQuotes</code></td>
+ <td>true</td>
+ <td>a flag indicating whether values containing quotes should always be
enclosed in quotes. It escapes all values containing a quote character by
default.</td>
+ <td>write</td>
+ </tr>
+ <tr>
+ <td><code>comment</code></td>
+ <td>""</td>
+ <td>Sets a single character used for skipping lines beginning with this
character.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>header</code></td>
+ <td>false</td>
+ <td>For reading, uses the first line as names of columns. For writing,
writes the names of columns as the first line. Note that if the given path is a
RDD of Strings, this header option will remove all lines same with the header
if exists.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>inferSchema</code></td>
+ <td>false</td>
+ <td>Infers the input schema automatically from data. It requires one extra
pass over the data.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>enforceSchema</code></td>
+ <td>true</td>
+ <td>If it is set to <code>true</code>, the specified or inferred schema
will be forcibly applied to datasource files, and headers in CSV files will be
ignored. If the option is set to <code>false</code>, the schema will be
validated against all headers in CSV files or the first header in RDD if the
<code>header</code> option is set to <code>true</code>. Field names in the
schema and column names in CSV headers are checked by their positions taking
into account <code>spark.sql.caseSensitive</code>. Though the default value is
<code>true</code>, it is recommended to disable the <code>enforceSchema</code>
option to avoid incorrect results.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>ignoreLeadingWhiteSpace</code></td>
+ <td><code>false</code> for reading, <code>true</code> for writing</td>
+ <td>A flag indicating whether or not leading whitespaces from values being
read/written should be skipped.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>ignoreTrailingWhiteSpace</code></td>
+ <td><code>false</code> for reading, <code>true</code> for writing</td>
+ <td>A flag indicating whether or not trailing whitespaces from values
being read/written should be skipped.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>nullValue</code></td>
+ <td>""</td>
+ <td>Sets the string representation of a null value. Since 2.0.1, this
<code>nullValue</code> param applies to all supported types including the
string type.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>nanValue</code></td>
+ <td>NaN</td>
+ <td>Sets the string representation of a non-number value.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>positiveInf</code></td>
+ <td>Inf</td>
+ <td>Sets the string representation of a positive infinity value.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>negativeInf</code></td>
+ <td>-Inf</td>
+ <td>Sets the string representation of a negative infinity value.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>dateFormat</code></td>
+ <td>yyyy-MM-dd</td>
+ <td>sets the string that indicates a date format. Custom date formats
follow the formats at <a
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">datetime
pattern</a>. This applies to date type.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>timestampFormat</code></td>
+ <td>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</td>
+ <td>Sets the string that indicates a timestamp format. Custom date formats
follow the formats at <a
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">datetime
pattern</a>. This applies to timestamp type.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>maxColumns</code></td>
+ <td>20480</td>
+ <td>Defines a hard limit of how many columns a record can have.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>maxCharsPerColumn</code></td>
+ <td>-1</td>
+ <td>Defines the maximum number of characters allowed for any given value
being read. The default value <code>-1</code> means unlimited length.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>maxMalformedLogPerPartition</code></td>
+ <td>(none)</td>
+ <td>This parameter is no longer used since Spark 2.2.0. If specified, it
is ignored.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>mode</code></td>
+ <td>PERMISSIVE</td>
+ <td>Allows a mode for dealing with corrupt records during parsing. Note
that Spark tries to parse only required columns in CSV under column pruning.
Therefore, corrupt records can be different based on required set of fields.
This behavior can be controlled by
<code>spark.sql.csv.parser.columnPruning.enabled</code> (enabled by
default).<br>
+ <ul>
+ <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the
malformed string into a field configured by
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to
<code>null</code>. To keep corrupt records, an user can set a string type field
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a
schema does not have the field, it drops corrupt records during parsing. A
record with less/more tokens than schema is not a corrupted record to CSV. When
it meets a record having fewer tokens than the length of the schema, sets
<code>null</code> to extra fields. When the record has more tokens than the
length of the schema, it drops extra tokens.</li>
+ <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+ <li><code>FAILFAST</code>: throws an exception when it meets corrupted
records.</li>
+ </ul>
+ </td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>columnNameOfCorruptRecord</code></td>
+ <td>The value specified in
<code>spark.sql.columnNameOfCorruptRecord</code></td>
+ <td>Allows renaming the new field having malformed string created by
<code>PERMISSIVE</code> mode. This overrides
<code>spark.sql.columnNameOfCorruptRecord</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>multiLine</code></td>
+ <td>false</td>
+ <td>Parse one record, which may span multiple lines, per file.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>charToEscapeQuoteEscaping</code></td>
+ <td><code>escape</code> or <code>\0</code></td>
+ <td>Sets a single character used for escaping the escape for the quote
character. The default value is escape character when escape and quote
characters are different, <code>\0</code> otherwise.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>samplingRatio</code></td>
+ <td>1.0</td>
+ <td>Defines fraction of input JSON objects used for schema inferring.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>emptyValue</code></td>
+ <td>""</td>
+ <td>Sets the string representation of an empty value.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>locale</code></td>
+ <td>en-US</td>
+ <td>Sets a locale as language tag in IETF BCP 47 format. For instance,
<code>locale</code> is used while parsing dates and timestamps.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>lineSep</code></td>
+ <td><code>\r</code>, <code>\r\n</code>, <code>\n</code> for reading,
<code>\n</code> for writing</td>
Review comment:
```suggestion
<td><code>\r</code>, <code>\r\n</code> and <code>\n</code> (for
reading), <code>\n</code> (for writing)</td>
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]