[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

GitBox Tue, 25 May 2021 03:55:05 -0700


HyukjinKwon commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r638679043




##########
File path: docs/sql-data-sources-json.md
##########
@@ -94,3 +94,168 @@ SELECT * FROM jsonTable
 </div>
 
 </div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+* the `.option`/`.options` methods of
+  *  `DataFrameReader` 
+  *  `DataFrameWriter`
+  *  `DataStreamReader` 
+  *  `DataStreamWriter`
+
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+  <tr>
+    <!-- TODO(SPARK-35433): Add timeZone to Data Source Option for CSV, too. 
-->
+    <td><code>timeZone</code></td>
+    <td>None</td>
+    <td>Sets the string that indicates a time zone ID to be used to format 
timestamps in the JSON datasources or partition values. The following formats 
of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 
'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example 
'-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of 
'+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can 
be ambiguous. If it isn't set, the current value of the SQL config 
<code>spark.sql.session.timeZone</code> is used by default.
+    </td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>primitivesAsString</code></td>
+    <td>None</td>
+    <td>Infers all primitive values as a string type. If None is set, it uses 
the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>prefersDecimal</code></td>
+    <td>None</td>
+    <td>Infers all floating-point values as a decimal type. If the values do 
not fit in decimal, then it infers them as doubles. If None is set, it uses the 
default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowComments</code></td>
+    <td>None</td>
+    <td>Ignores Java/C++ style comment in JSON records. If None is set, it 
uses the default value, <code>false</code></td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowUnquotedFieldNames</code></td>
+    <td>None</td>
+    <td>Allows unquoted JSON field names. If None is set, it uses the default 
value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowSingleQuotes</code></td>
+    <td>None</td>
+    <td>Allows single quotes in addition to double quotes. If None is set, it 
uses the default value, <code>true</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowNumericLeadingZero</code></td>
+    <td>None</td>
+    <td>Allows leading zeros in numbers (e.g. 00012). If None is set, it uses 
the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowBackslashEscapingAnyCharacter</code></td>
+    <td>None</td>
+    <td>Allows accepting quoting of all character using backslash quoting 
mechanism. If None is set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>mode</code></td>
+    <td>None</td>
+    <td>Allows a mode for dealing with corrupt records during parsing. If None 
is set, it uses the default value, <code>PERMISSIVE</code><br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> 
field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>columnNameOfCorruptRecord</code></td>
+    <td>None</td>
+    <td>Allows renaming the new field having malformed string created by 
<code>PERMISSIVE</code> mode. This overrides 
spark.sql.columnNameOfCorruptRecord. If None is set, it uses the value 
specified in <code>spark.sql.columnNameOfCorruptRecord</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td>None</td>
+    <td>Sets the string that indicates a date format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to date type. If None is set, it uses the 
default value, <code>yyyy-MM-dd</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td>None</td>
+    <td>Sets the string that indicates a timestamp format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to timestamp type. If None is set, it uses 
the default value, <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>multiLine</code></td>
+    <td>None</td>
+    <td>Parse one record, which may span multiple lines, per file. If None is 
set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowUnquotedControlChars</code></td>
+    <td>None</td>
+    <td>Allows JSON Strings to contain unquoted control characters (ASCII 
characters with value less than 32, including tab and line feed characters) or 
not.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>encoding</code></td>
+    <td>None</td>
+    <td>For reading, allows to forcibly set one of standard basic or extended 
encoding for the JSON files. For example UTF-16BE, UTF-32LE. If None is set, 
the encoding of input JSON will be detected automatically when the multiLine 
option is set to <code>true</code>. For writing, Specifies encoding (charset) 
of saved json files. If None is set, the default UTF-8 charset will be 
used.</td>

Review comment:
       Also fix the docs properly from `None` to something else. That only 
applies to Python side.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

Reply via email to