HyukjinKwon commented on a change in pull request #32658:
URL: https://github.com/apache/spark/pull/32658#discussion_r638681682
##########
File path: docs/sql-data-sources-csv.md
##########
@@ -38,3 +38,217 @@ Spark SQL provides `spark.read().csv("file_name")` to read
a file or directory o
</div>
</div>
+
+## Data Source Option
+
+Data source options of CSV can be set via:
+* the `.option`/`.options` methods of
+ * `DataFrameReader`
+ * `DataFrameWriter`
+ * `DataStreamReader`
+ * `DataStreamWriter`
+
+<table class="table">
+ <tr><th><b>Property
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+ <tr>
+ <td><code>sep</code></td>
+ <td>None</td>
+ <td>Sets a separator (one or more characters) for each field and value. If
None is set, it uses the default value, <code>,</code>.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>encoding</code></td>
+ <td>None</td>
+ <td>For reading, decodes the CSV files by the given encoding type. If None
is set, it uses the default value, <code>UTF-8</code>. For writing, sets the
encoding (charset) of saved csv files. If None is set, the default UTF-8
charset will be used.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>quote</code></td>
+ <td>None</td>
+ <td>Sets a single character used for escaping quoted values where the
separator can be part of the value. If None is set, it uses the default value,
<code>"</code>. If you would like to turn off quotations, you need to set an
empty string. If an empty string is set, it uses <code>u0000</code> (null
character).</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>quoteAll</code></td>
+ <td>None</td>
+ <td>A flag indicating whether all values should always be enclosed in
quotes. If None is set, it uses the default value <code>false</code>, only
escaping values containing a quote character.</td>
+ <td>write</td>
+ </tr>
+ <tr>
+ <td><code>escape</code></td>
+ <td>None</td>
+ <td>Sets a single character used for escaping quotes inside an already
quoted value. If None is set, it uses the default value, <code>\</code>.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>escapeQuotes</code></td>
+ <td>None</td>
+ <td>a flag indicating whether values containing quotes should always be
enclosed in quotes. If None is set, it uses the default value
<code>true</code>, escaping all values containing a quote character.</td>
+ <td>write</td>
+ </tr>
+ <tr>
+ <td><code>comment</code></td>
+ <td>None</td>
+ <td>Sets a single character used for skipping lines beginning with this
character. By default (None), it is disabled.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>header</code></td>
+ <td>None</td>
+ <td>For reading, uses the first line as names of columns. For writing,
writes the names of columns as the first line. If None is set, it uses the
default value, <code>false</code>. Note that if the given path is a RDD of
Strings, this header option will remove all lines same with the header if
exists.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>inferSchema</code></td>
+ <td>None</td>
+ <td>Infers the input schema automatically from data. It requires one extra
pass over the data. If None is set, it uses the default value,
<code>false</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>enforceSchema</code></td>
+ <td>None</td>
+ <td>If it is set to <code>true</code>, the specified or inferred schema
will be forcibly applied to datasource files, and headers in CSV files will be
ignored. If the option is set to <code>false</code>, the schema will be
validated against all headers in CSV files or the first header in RDD if the
<code>header</code> option is set to <code>true</code>. Field names in the
schema and column names in CSV headers are checked by their positions taking
into account <code>spark.sql.caseSensitive</code>. If None is set,
<code>true</code> is used by default. Though the default value is
<code>true</code>, it is recommended to disable the <code>enforceSchema</code>
option to avoid incorrect results.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>ignoreLeadingWhiteSpace</code></td>
+ <td>None</td>
+ <td>A flag indicating whether or not leading whitespaces from values being
read/written should be skipped. If None is set, it uses the default value,
<code>false</code> for reading, and <code>true</code> for writing.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>ignoreTrailingWhiteSpace</code></td>
+ <td>None</td>
+ <td>A flag indicating whether or not trailing whitespaces from values
being read/written should be skipped. If None is set, it uses the default
value, <code>false</code> for reading, and <code>true</code> for writing.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>nullValue</code></td>
+ <td>None</td>
+ <td>Sets the string representation of a null value. If None is set, it
uses the default value, empty string. Since 2.0.1, this <code>nullValue</code>
param applies to all supported types including the string type.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>nanValue</code></td>
+ <td>None</td>
+ <td>Sets the string representation of a non-number value. If None is set,
it uses the default value, <code>NaN</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>positiveInf</code></td>
+ <td>None</td>
+ <td>Sets the string representation of a positive infinity value. If None
is set, it uses the default value, <code>Inf</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>negativeInf</code></td>
+ <td>None</td>
+ <td>Sets the string representation of a negative infinity value. If None
is set, it uses the default value, <code>Inf</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>dateFormat</code></td>
+ <td>None</td>
+ <td>sets the string that indicates a date format. Custom date formats
follow the formats at <a
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">datetime
pattern</a>. This applies to date type. If None is set, it uses the default
value, <code>yyyy-MM-dd</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>timestampFormat</code></td>
+ <td>None</td>
+ <td>Sets the string that indicates a timestamp format. Custom date formats
follow the formats at <a
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">datetime
pattern</a>. This applies to timestamp type. If None is set, it uses the
default value, <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code>.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>maxColumns</code></td>
+ <td>None</td>
+ <td>Defines a hard limit of how many columns a record can have. If None is
set, it uses the default value, <code>20480</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>maxCharsPerColumn</code></td>
+ <td>None</td>
+ <td>Defines the maximum number of characters allowed for any given value
being read. If None is set, it uses the default value, <code>-1</code> meaning
unlimited length.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>maxMalformedLogPerPartition</code></td>
+ <td>None</td>
+ <td>This parameter is no longer used since Spark 2.2.0. If specified, it
is ignored.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>mode</code></td>
+ <td>None</td>
+ <td>Allows a mode for dealing with corrupt records during parsing. If None
is set, it uses the default value, <code>PERMISSIVE</code>. Note that Spark
tries to parse only required columns in CSV under column pruning. Therefore,
corrupt records can be different based on required set of fields. This behavior
can be controlled by <code>spark.sql.csv.parser.columnPruning.enabled</code>
(enabled by default).<br>
+ <ul>
+ <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the
malformed string into a field configured by
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to
<code>null</code>. To keep corrupt records, an user can set a string type field
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a
schema does not have the field, it drops corrupt records during parsing. A
record with less/more tokens than schema is not a corrupted record to CSV. When
it meets a record having fewer tokens than the length of the schema, sets
<code>null</code> to extra fields. When the record has more tokens than the
length of the schema, it drops extra tokens.</li>
+ <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+ <li><code>FAILFAST</code>: throws an exception when it meets corrupted
records.</li>
+ </ul>
+ </td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>columnNameOfCorruptRecord</code></td>
+ <td>None</td>
+ <td>Allows renaming the new field having malformed string created by
<code>PERMISSIVE</code> mode. This overrides
<code>spark.sql.columnNameOfCorruptRecord</code>. If None is set, it uses the
value specified in <code>spark.sql.columnNameOfCorruptRecord</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>multiLine</code></td>
+ <td>None</td>
+ <td>Parse one record, which may span multiple lines, per file. If None is
set, it uses the default value, <code>false</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>charToEscapeQuoteEscaping</code></td>
+ <td>None</td>
+ <td>Sets a single character used for escaping the escape for the quote
character. If None is set, the default value is escape character when escape
and quote characters are different, <code>\0</code> otherwise.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>samplingRatio</code></td>
+ <td>None</td>
+ <td>Defines fraction of input JSON objects used for schema inferring. If
None is set, it uses the default value, <code>1.0</code>.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>emptyValue</code></td>
+ <td>None</td>
+ <td>Sets the string representation of an empty value. If None is set, it
uses the default value, <code>""</code>.</td>
+ <td>read/write</td>
+ </tr>
+ <tr>
+ <td><code>locale</code></td>
+ <td>None</td>
+ <td>Sets a locale as language tag in IETF BCP 47 format. If None is set,
it uses the default value, <code>en-US</code>. For instance,
<code>locale</code> is used while parsing dates and timestamps.</td>
+ <td>read</td>
+ </tr>
+ <tr>
+ <td><code>lineSep</code></td>
+ <td>None</td>
+ <td>Defines the line separator that should be used for parsing. If None is
set, it covers all <code>\\r</code>, <code>\\r\\n</code> and <code>\\n</code>.
Maximum length is 1 character.</td>
Review comment:
what about writing?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]