[GitHub] [spark] HyukjinKwon commented on a change in pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

GitBox Thu, 27 May 2021 04:00:38 -0700


HyukjinKwon commented on a change in pull request #32658:
URL: https://github.com/apache/spark/pull/32658#discussion_r640517394




##########
File path: docs/sql-data-sources-csv.md
##########
@@ -38,3 +36,223 @@ Spark SQL provides `spark.read().csv("file_name")` to read 
a file or directory o
 </div>
 
 </div>
+
+## Data Source Option
+
+Data source options of CSV can be set via:
+* the `.option`/`.options` methods of
+  *  `DataFrameReader`
+  *  `DataFrameWriter`
+  *  `DataStreamReader`
+  *  `DataStreamWriter`
+* the built-in functions below
+  * `from_csv`
+  * `to_csv`
+  * `schema_of_csv`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+  <tr>
+    <td><code>sep</code></td>
+    <td>,</td>
+    <td>Sets a separator (one or more characters) for each field and 
value.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code> for reading, not set for writing</td>
+    <td>For reading, decodes the CSV files by the given encoding type.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>quote</code></td>
+    <td>"</td>
+    <td>Sets a single character used for escaping quoted values where the 
separator can be part of the value. If you would like to turn off quotations, 
you need to set an empty string. If an empty string is set, it uses 
<code>u0000</code> (null character).</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>quoteAll</code></td>
+    <td>false</td>
+    <td>A flag indicating whether all values should always be enclosed in 
quotes. It only escapes values containing a quote character by default.</td>
+    <td>write</td>
+  </tr>
+  <tr>
+    <td><code>escape</code></td>
+    <td>\</td>
+    <td>Sets a single character used for escaping quotes inside an already 
quoted value.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>escapeQuotes</code></td>
+    <td>true</td>
+    <td>a flag indicating whether values containing quotes should always be 
enclosed in quotes. It escapes all values containing a quote character by 
default.</td>
+    <td>write</td>
+  </tr>
+  <tr>
+    <td><code>comment</code></td>
+    <td>""</td>
+    <td>Sets a single character used for skipping lines beginning with this 
character.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>header</code></td>
+    <td>false</td>
+    <td>For reading, uses the first line as names of columns. For writing, 
writes the names of columns as the first line. Note that if the given path is a 
RDD of Strings, this header option will remove all lines same with the header 
if exists.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>inferSchema</code></td>
+    <td>false</td>
+    <td>Infers the input schema automatically from data. It requires one extra 
pass over the data.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>enforceSchema</code></td>
+    <td>true</td>
+    <td>If it is set to <code>true</code>, the specified or inferred schema 
will be forcibly applied to datasource files, and headers in CSV files will be 
ignored. If the option is set to <code>false</code>, the schema will be 
validated against all headers in CSV files or the first header in RDD if the 
<code>header</code> option is set to <code>true</code>. Field names in the 
schema and column names in CSV headers are checked by their positions taking 
into account <code>spark.sql.caseSensitive</code>. Though the default value is 
<code>true</code>, it is recommended to disable the <code>enforceSchema</code> 
option to avoid incorrect results.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>ignoreLeadingWhiteSpace</code></td>
+    <td><code>false</code> for reading, <code>true</code> for writing</td>
+    <td>A flag indicating whether or not leading whitespaces from values being 
read/written should be skipped.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>ignoreTrailingWhiteSpace</code></td>
+    <td><code>false</code> for reading, <code>true</code> for writing</td>
+    <td>A flag indicating whether or not trailing whitespaces from values 
being read/written should be skipped.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>nullValue</code></td>
+    <td>""</td>
+    <td>Sets the string representation of a null value. Since 2.0.1, this 
<code>nullValue</code> param applies to all supported types including the 
string type.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>nanValue</code></td>
+    <td>NaN</td>
+    <td>Sets the string representation of a non-number value.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>positiveInf</code></td>
+    <td>Inf</td>
+    <td>Sets the string representation of a positive infinity value.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>negativeInf</code></td>
+    <td>-Inf</td>
+    <td>Sets the string representation of a negative infinity value.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td>yyyy-MM-dd</td>
+    <td>sets the string that indicates a date format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>datetime
 pattern</a>. This applies to date type.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</td>
+    <td>Sets the string that indicates a timestamp format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>datetime
 pattern</a>. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>maxColumns</code></td>
+    <td>20480</td>
+    <td>Defines a hard limit of how many columns a record can have.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>maxCharsPerColumn</code></td>
+    <td>-1</td>
+    <td>Defines the maximum number of characters allowed for any given value 
being read. The default value <code>-1</code> means unlimited length.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>maxMalformedLogPerPartition</code></td>
+    <td>(none)</td>
+    <td>This parameter is no longer used since Spark 2.2.0. If specified, it 
is ignored.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>mode</code></td>
+    <td>PERMISSIVE</td>
+    <td>Allows a mode for dealing with corrupt records during parsing. Note 
that Spark tries to parse only required columns in CSV under column pruning. 
Therefore, corrupt records can be different based on required set of fields. 
This behavior can be controlled by 
<code>spark.sql.csv.parser.columnPruning.enabled</code> (enabled by 
default).<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. A 
record with less/more tokens than schema is not a corrupted record to CSV. When 
it meets a record having fewer tokens than the length of the schema, sets 
<code>null</code> to extra fields. When the record has more tokens than the 
length of the schema, it drops extra tokens.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>columnNameOfCorruptRecord</code></td>
+    <td>The value specified in 
<code>spark.sql.columnNameOfCorruptRecord</code></td>
+    <td>Allows renaming the new field having malformed string created by 
<code>PERMISSIVE</code> mode. This overrides 
<code>spark.sql.columnNameOfCorruptRecord</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>multiLine</code></td>
+    <td>false</td>
+    <td>Parse one record, which may span multiple lines, per file.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>charToEscapeQuoteEscaping</code></td>
+    <td><code>escape</code> or <code>\0</code></td>
+    <td>Sets a single character used for escaping the escape for the quote 
character. The default value is escape character when escape and quote 
characters are different, <code>\0</code> otherwise.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td>1.0</td>
+    <td>Defines fraction of input JSON objects used for schema inferring.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>emptyValue</code></td>
+    <td>""</td>
+    <td>Sets the string representation of an empty value.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>locale</code></td>
+    <td>en-US</td>
+    <td>Sets a locale as language tag in IETF BCP 47 format. For instance, 
<code>locale</code> is used while parsing dates and timestamps.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>lineSep</code></td>
+    <td><code>\r</code>, <code>\r\n</code>, <code>\n</code> for reading, 
<code>\n</code> for writing</td>

Review comment:
       ```suggestion
       <td><code>\r</code>, <code>\r\n</code> and <code>\n</code> (for 
reading), <code>\n</code> (for writing)</td>
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

Reply via email to