[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

GitBox Mon, 17 May 2021 11:07:03 -0700


HyukjinKwon commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r633025267




##########
File path: docs/sql-data-sources-json.md
##########
@@ -94,3 +94,146 @@ SELECT * FROM jsonTable
 </div>
 
 </div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+* the `.option`/`.options` methods of `DataFrameReader` or `DataFrameWriter`
+* the `.option`/`.options` methods of `DataStreamReader` or `DataStreamWriter`
+
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+  <tr>
+    <td><code>primitivesAsString</code></td>
+    <td>None</td>
+    <td>infers all primitive values as a string type. If None is set, it uses 
the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>prefersDecimal</code></td>
+    <td>None</td>
+    <td>infers all floating-point values as a decimal type. If the values do 
not fit in decimal, then it infers them as doubles. If None is set, it uses the 
default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowComments</code></td>
+    <td>None</td>
+    <td>ignores Java/C++ style comment in JSON records. If None is set, it 
uses the default value, <code>false</code></td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowUnquotedFieldNames</code></td>
+    <td>None</td>
+    <td>allows unquoted JSON field names. If None is set, it uses the default 
value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowSingleQuotes</code></td>
+    <td>None</td>
+    <td>allows single quotes in addition to double quotes. If None is set, it 
uses the default value, <code>true</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowNumericLeadingZero</code></td>
+    <td>None</td>
+    <td>allows leading zeros in numbers (e.g. 00012). If None is set, it uses 
the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowBackslashEscapingAnyCharacter</code></td>
+    <td>None</td>
+    <td>allows accepting quoting of all character using backslash quoting 
mechanism. If None is set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>columnNameOfCorruptRecord</code></td>
+    <td>None</td>
+    <td>allows renaming the new field having malformed string created by 
<code>PERMISSIVE</code> mode. This overrides 
spark.sql.columnNameOfCorruptRecord. If None is set, it uses the value 
specified in <code>spark.sql.columnNameOfCorruptRecord</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td>None</td>
+    <td>sets the string that indicates a date format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to date type. If None is set, it uses the 
default value, <code>yyyy-MM-dd</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td>None</td>
+    <td>sets the string that indicates a timestamp format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to timestamp type. If None is set, it uses 
the default value, <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>multiLine</code></td>
+    <td>None</td>
+    <td>parse one record, which may span multiple lines, per file. If None is 
set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowUnquotedControlChars</code></td>
+    <td>None</td>
+    <td>allows JSON Strings to contain unquoted control characters (ASCII 
characters with value less than 32, including tab and line feed characters) or 
not.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>encoding</code></td>
+    <td>None</td>
+    <td>allows to forcibly set one of standard basic or extended encoding for 
the JSON files. For example UTF-16BE, UTF-32LE. If None is set, the encoding of 
input JSON will be detected automatically when the multiLine option is set to 
<code>true</code>.</td>
+    <td>read</td>

Review comment:
       read/write

##########
File path: docs/sql-data-sources-json.md
##########
@@ -94,3 +94,146 @@ SELECT * FROM jsonTable
 </div>
 
 </div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+* the `.option`/`.options` methods of `DataFrameReader` or `DataFrameWriter`
+* the `.option`/`.options` methods of `DataStreamReader` or `DataStreamWriter`
+
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+  <tr>
+    <td><code>primitivesAsString</code></td>
+    <td>None</td>
+    <td>infers all primitive values as a string type. If None is set, it uses 
the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>prefersDecimal</code></td>
+    <td>None</td>
+    <td>infers all floating-point values as a decimal type. If the values do 
not fit in decimal, then it infers them as doubles. If None is set, it uses the 
default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowComments</code></td>
+    <td>None</td>
+    <td>ignores Java/C++ style comment in JSON records. If None is set, it 
uses the default value, <code>false</code></td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowUnquotedFieldNames</code></td>
+    <td>None</td>
+    <td>allows unquoted JSON field names. If None is set, it uses the default 
value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowSingleQuotes</code></td>
+    <td>None</td>
+    <td>allows single quotes in addition to double quotes. If None is set, it 
uses the default value, <code>true</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowNumericLeadingZero</code></td>
+    <td>None</td>
+    <td>allows leading zeros in numbers (e.g. 00012). If None is set, it uses 
the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowBackslashEscapingAnyCharacter</code></td>
+    <td>None</td>
+    <td>allows accepting quoting of all character using backslash quoting 
mechanism. If None is set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>columnNameOfCorruptRecord</code></td>
+    <td>None</td>
+    <td>allows renaming the new field having malformed string created by 
<code>PERMISSIVE</code> mode. This overrides 
spark.sql.columnNameOfCorruptRecord. If None is set, it uses the value 
specified in <code>spark.sql.columnNameOfCorruptRecord</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td>None</td>
+    <td>sets the string that indicates a date format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to date type. If None is set, it uses the 
default value, <code>yyyy-MM-dd</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td>None</td>
+    <td>sets the string that indicates a timestamp format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to timestamp type. If None is set, it uses 
the default value, <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>multiLine</code></td>
+    <td>None</td>
+    <td>parse one record, which may span multiple lines, per file. If None is 
set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowUnquotedControlChars</code></td>
+    <td>None</td>
+    <td>allows JSON Strings to contain unquoted control characters (ASCII 
characters with value less than 32, including tab and line feed characters) or 
not.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>encoding</code></td>
+    <td>None</td>
+    <td>allows to forcibly set one of standard basic or extended encoding for 
the JSON files. For example UTF-16BE, UTF-32LE. If None is set, the encoding of 
input JSON will be detected automatically when the multiLine option is set to 
<code>true</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>lineSep</code></td>
+    <td>None</td>
+    <td>defines the line separator that should be used for parsing. If None is 
set, it covers all <code>\r</code>, <code>\r\n</code> and <code>\n</code>.</td>
+    <td>read</td>

Review comment:
       read/write

##########
File path: docs/sql-data-sources-json.md
##########
@@ -94,3 +94,146 @@ SELECT * FROM jsonTable
 </div>
 
 </div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+* the `.option`/`.options` methods of `DataFrameReader` or `DataFrameWriter`
+* the `.option`/`.options` methods of `DataStreamReader` or `DataStreamWriter`
+
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+  <tr>
+    <td><code>primitivesAsString</code></td>
+    <td>None</td>
+    <td>infers all primitive values as a string type. If None is set, it uses 
the default value, <code>false</code>.</td>

Review comment:
       Captalize

##########
File path: docs/sql-data-sources-json.md
##########
@@ -94,3 +94,146 @@ SELECT * FROM jsonTable
 </div>
 
 </div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+* the `.option`/`.options` methods of `DataFrameReader` or `DataFrameWriter`
+* the `.option`/`.options` methods of `DataStreamReader` or `DataStreamWriter`
+
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+  <tr>
+    <td><code>primitivesAsString</code></td>
+    <td>None</td>
+    <td>infers all primitive values as a string type. If None is set, it uses 
the default value, <code>false</code>.</td>

Review comment:
       Capitalize it

##########
File path: docs/sql-data-sources-json.md
##########
@@ -94,3 +94,146 @@ SELECT * FROM jsonTable
 </div>
 
 </div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+* the `.option`/`.options` methods of `DataFrameReader` or `DataFrameWriter`
+* the `.option`/`.options` methods of `DataStreamReader` or `DataStreamWriter`
+
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+  <tr>
+    <td><code>primitivesAsString</code></td>
+    <td>None</td>
+    <td>infers all primitive values as a string type. If None is set, it uses 
the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>prefersDecimal</code></td>
+    <td>None</td>
+    <td>infers all floating-point values as a decimal type. If the values do 
not fit in decimal, then it infers them as doubles. If None is set, it uses the 
default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowComments</code></td>
+    <td>None</td>
+    <td>ignores Java/C++ style comment in JSON records. If None is set, it 
uses the default value, <code>false</code></td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowUnquotedFieldNames</code></td>
+    <td>None</td>
+    <td>allows unquoted JSON field names. If None is set, it uses the default 
value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowSingleQuotes</code></td>
+    <td>None</td>
+    <td>allows single quotes in addition to double quotes. If None is set, it 
uses the default value, <code>true</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowNumericLeadingZero</code></td>
+    <td>None</td>
+    <td>allows leading zeros in numbers (e.g. 00012). If None is set, it uses 
the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowBackslashEscapingAnyCharacter</code></td>
+    <td>None</td>
+    <td>allows accepting quoting of all character using backslash quoting 
mechanism. If None is set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>columnNameOfCorruptRecord</code></td>
+    <td>None</td>
+    <td>allows renaming the new field having malformed string created by 
<code>PERMISSIVE</code> mode. This overrides 
spark.sql.columnNameOfCorruptRecord. If None is set, it uses the value 
specified in <code>spark.sql.columnNameOfCorruptRecord</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td>None</td>
+    <td>sets the string that indicates a date format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to date type. If None is set, it uses the 
default value, <code>yyyy-MM-dd</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td>None</td>
+    <td>sets the string that indicates a timestamp format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to timestamp type. If None is set, it uses 
the default value, <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code>.</td>
+    <td>read/write</td>
+  </tr>
+  <tr>
+    <td><code>multiLine</code></td>
+    <td>None</td>
+    <td>parse one record, which may span multiple lines, per file. If None is 
set, it uses the default value, <code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowUnquotedControlChars</code></td>
+    <td>None</td>
+    <td>allows JSON Strings to contain unquoted control characters (ASCII 
characters with value less than 32, including tab and line feed characters) or 
not.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>encoding</code></td>
+    <td>None</td>
+    <td>allows to forcibly set one of standard basic or extended encoding for 
the JSON files. For example UTF-16BE, UTF-32LE. If None is set, the encoding of 
input JSON will be detected automatically when the multiLine option is set to 
<code>true</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>lineSep</code></td>
+    <td>None</td>
+    <td>defines the line separator that should be used for parsing. If None is 
set, it covers all <code>\r</code>, <code>\r\n</code> and <code>\n</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td>None</td>
+    <td>defines fraction of input JSON objects used for schema inferring. If 
None is set, it uses the default value, <code>1.0</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>dropFieldIfAllNull</code></td>
+    <td>None</td>
+    <td>whether to ignore column of all null values or empty array/struct 
during schema inference. If None is set, it uses the default value, 
<code>false</code>.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>locale</code></td>
+    <td>None</td>
+    <td>sets a locale as language tag in IETF BCP 47 format. If None is set, 
it uses the default value, <code>en-US</code>. For instance, 
<code>locale</code> is used while parsing dates and timestamps.</td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>allowNonNumericNumbers</code></td>
+    <td>None</td>
+    <td>allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as 
legal floating number values. If None is set, it uses the default value, 
<code>true</code>.<br>
+    <ul>
+      <li><code>+INF</code>: for positive infinity, as well as alias of 
<code>+Infinity</code> and <code>Infinity</code>.</li>
+      <li><code>-INF</code>: for negative infinity, alias 
<code>-Infinity</code>.</li>
+      <li><code>NaN</code>: for other not-a-numbers, like result of division 
by zero.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+  <tr>
+    <td><code>compression</code></td>
+    <td>None</td>
+    <td>compression codec to use when saving to file. This can be one of the 
known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and 
deflate).</td>
+    <td>write</td>
+  </tr>
+  <tr>
+    <td><code>encoding</code></td>

Review comment:
       Can you combine with `encoding` option above?

##########
File path: python/pyspark/sql/readwriter.py
##########
@@ -236,33 +236,9 @@ def json(self, path, schema=None, primitivesAsString=None, 
prefersDecimal=None,
         schema : :class:`pyspark.sql.types.StructType` or str, optional
             an optional :class:`pyspark.sql.types.StructType` for the input 
schema or
             a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
-        primitivesAsString : str or bool, optional
-            infers all primitive values as a string type. If None is set,
-            it uses the default value, ``false``.
-        prefersDecimal : str or bool, optional
-            infers all floating-point values as a decimal type. If the values
-            do not fit in decimal, then it infers them as doubles. If None is
-            set, it uses the default value, ``false``.
-        allowComments : str or bool, optional
-            ignores Java/C++ style comment in JSON records. If None is set,
-            it uses the default value, ``false``.
-        allowUnquotedFieldNames : str or bool, optional
-            allows unquoted JSON field names. If None is set,
-            it uses the default value, ``false``.
-        allowSingleQuotes : str or bool, optional
-            allows single quotes in addition to double quotes. If None is
-            set, it uses the default value, ``true``.
-        allowNumericLeadingZero : str or bool, optional
-            allows leading zeros in numbers (e.g. 00012). If None is
-            set, it uses the default value, ``false``.
-        allowBackslashEscapingAnyCharacter : str or bool, optional
-            allows accepting quoting of all character
-            using backslash quoting mechanism. If None is
-            set, it uses the default value, ``false``.
         mode : str, optional

Review comment:
       This mode is an option. `mode` in write is not an option.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
##########
@@ -4131,6 +4131,9 @@ object functions {
    * @param schema the schema to use when parsing the json string
    * @param options options to control how the json is parsed. Accepts the 
same options as the
    *                json data source.
+   *                See
+   *                <a 
href="http://127.0.0.1:4000/sql-data-sources-json.html#data-source-option";>

Review comment:
       `http://127.0.0.1:4000` seems weird




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

Reply via email to