[GitHub] [spark] HyukjinKwon commented on a change in pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

GitBox Tue, 25 May 2021 03:59:21 -0700


HyukjinKwon commented on a change in pull request #32658:
URL: https://github.com/apache/spark/pull/32658#discussion_r638681682




##########
File path: docs/sql-data-sources-csv.md
##########
@@ -38,3 +38,217 @@ Spark SQL provides `spark.read().csv("file_name")` to read 
a file or directory o
 </div>
 
 </div>
+
+## Data Source Option
+
+Data source options of CSV can be set via:
+* the `.option`/`.options` methods of
+  *  `DataFrameReader`
+  *  `DataFrameWriter`
+  *  `DataStreamReader`
+  *  `DataStreamWriter`
+
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+  <tr>
+    <td><code>sep</code></td>
+    <td>None</td>
+    <td>Sets a separator (one or more characters) for each field and value. If 
None is set, it uses the default value, <code>,</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>encoding</code></td>
+    <td>None</td>
+    <td>For reading, decodes the CSV files by the given encoding type. If None 
is set, it uses the default value, <code>UTF-8</code>. For writing, sets the 
encoding (charset) of saved csv files. If None is set, the default UTF-8 
charset will be used.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>quote</code></td>
+    <td>None</td>
+    <td>Sets a single character used for escaping quoted values where the 
separator can be part of the value. If None is set, it uses the default value, 
<code>"</code>. If you would like to turn off quotations, you need to set an 
empty string. If an empty string is set, it uses <code>u0000</code> (null 
character).</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>quoteAll</code></td>
+    <td>None</td>
+    <td>A flag indicating whether all values should always be enclosed in 
quotes. If None is set, it uses the default value <code>false</code>, only 
escaping values containing a quote character.</td>
+    <td>write</td>
+  </tr>
+  <tr>
+    <td><code>escape</code></td>
+    <td>None</td>
+    <td>Sets a single character used for escaping quotes inside an already 
quoted value. If None is set, it uses the default value, <code>\</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>escapeQuotes</code></td>
+    <td>None</td>
+    <td>a flag indicating whether values containing quotes should always be 
enclosed in quotes. If None is set, it uses the default value 
<code>true</code>, escaping all values containing a quote character.</td>
+    <td>write</td>
+  </tr>
+  <tr>
+    <td><code>comment</code></td>
+    <td>None</td>
+    <td>Sets a single character used for skipping lines beginning with this 
character. By default (None), it is disabled.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>header</code></td>
+    <td>None</td>
+    <td>For reading, uses the first line as names of columns. For writing, 
writes the names of columns as the first line. If None is set, it uses the 
default value, <code>false</code>. Note that if the given path is a RDD of 
Strings, this header option will remove all lines same with the header if 
exists.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>inferSchema</code></td>
+    <td>None</td>
+    <td>Infers the input schema automatically from data. It requires one extra 
pass over the data. If None is set, it uses the default value, 
<code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>enforceSchema</code></td>
+    <td>None</td>
+    <td>If it is set to <code>true</code>, the specified or inferred schema 
will be forcibly applied to datasource files, and headers in CSV files will be 
ignored. If the option is set to <code>false</code>, the schema will be 
validated against all headers in CSV files or the first header in RDD if the 
<code>header</code> option is set to <code>true</code>. Field names in the 
schema and column names in CSV headers are checked by their positions taking 
into account <code>spark.sql.caseSensitive</code>. If None is set, 
<code>true</code> is used by default. Though the default value is 
<code>true</code>, it is recommended to disable the <code>enforceSchema</code> 
option to avoid incorrect results.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>ignoreLeadingWhiteSpace</code></td>
+    <td>None</td>
+    <td>A flag indicating whether or not leading whitespaces from values being 
read/written should be skipped. If None is set, it uses the default value, 
<code>false</code> for reading, and <code>true</code> for writing.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>ignoreTrailingWhiteSpace</code></td>
+    <td>None</td>
+    <td>A flag indicating whether or not trailing whitespaces from values 
being read/written should be skipped. If None is set, it uses the default 
value, <code>false</code> for reading, and <code>true</code> for writing.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>nullValue</code></td>
+    <td>None</td>
+    <td>Sets the string representation of a null value. If None is set, it 
uses the default value, empty string. Since 2.0.1, this <code>nullValue</code> 
param applies to all supported types including the string type.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>nanValue</code></td>
+    <td>None</td>
+    <td>Sets the string representation of a non-number value. If None is set, 
it uses the default value, <code>NaN</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>positiveInf</code></td>
+    <td>None</td>
+    <td>Sets the string representation of a positive infinity value. If None 
is set, it uses the default value, <code>Inf</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>negativeInf</code></td>
+    <td>None</td>
+    <td>Sets the string representation of a negative infinity value. If None 
is set, it uses the default value, <code>Inf</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td>None</td>
+    <td>sets the string that indicates a date format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>datetime
 pattern</a>. This applies to date type. If None is set, it uses the default 
value, <code>yyyy-MM-dd</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td>None</td>
+    <td>Sets the string that indicates a timestamp format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>datetime
 pattern</a>. This applies to timestamp type. If None is set, it uses the 
default value, <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>maxColumns</code></td>
+    <td>None</td>
+    <td>Defines a hard limit of how many columns a record can have. If None is 
set, it uses the default value, <code>20480</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>maxCharsPerColumn</code></td>
+    <td>None</td>
+    <td>Defines the maximum number of characters allowed for any given value 
being read. If None is set, it uses the default value, <code>-1</code> meaning 
unlimited length.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>maxMalformedLogPerPartition</code></td>
+    <td>None</td>
+    <td>This parameter is no longer used since Spark 2.2.0. If specified, it 
is ignored.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>mode</code></td>
+    <td>None</td>
+    <td>Allows a mode for dealing with corrupt records during parsing. If None 
is set, it uses the default value, <code>PERMISSIVE</code>. Note that Spark 
tries to parse only required columns in CSV under column pruning. Therefore, 
corrupt records can be different based on required set of fields. This behavior 
can be controlled by <code>spark.sql.csv.parser.columnPruning.enabled</code> 
(enabled by default).<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. A 
record with less/more tokens than schema is not a corrupted record to CSV. When 
it meets a record having fewer tokens than the length of the schema, sets 
<code>null</code> to extra fields. When the record has more tokens than the 
length of the schema, it drops extra tokens.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>columnNameOfCorruptRecord</code></td>
+    <td>None</td>
+    <td>Allows renaming the new field having malformed string created by 
<code>PERMISSIVE</code> mode. This overrides 
<code>spark.sql.columnNameOfCorruptRecord</code>. If None is set, it uses the 
value specified in <code>spark.sql.columnNameOfCorruptRecord</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>multiLine</code></td>
+    <td>None</td>
+    <td>Parse one record, which may span multiple lines, per file. If None is 
set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>charToEscapeQuoteEscaping</code></td>
+    <td>None</td>
+    <td>Sets a single character used for escaping the escape for the quote 
character. If None is set, the default value is escape character when escape 
and quote characters are different, <code>\0</code> otherwise.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td>None</td>
+    <td>Defines fraction of input JSON objects used for schema inferring. If 
None is set, it uses the default value, <code>1.0</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>emptyValue</code></td>
+    <td>None</td>
+    <td>Sets the string representation of an empty value. If None is set, it 
uses the default value, <code>""</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>locale</code></td>
+    <td>None</td>
+    <td>Sets a locale as language tag in IETF BCP 47 format. If None is set, 
it uses the default value, <code>en-US</code>. For instance, 
<code>locale</code> is used while parsing dates and timestamps.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>lineSep</code></td>
+    <td>None</td>
+    <td>Defines the line separator that should be used for parsing. If None is 
set, it covers all <code>\\r</code>, <code>\\r\\n</code> and <code>\\n</code>. 
Maximum length is 1 character.</td>

Review comment:
       what about writing?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

Reply via email to