[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

GitBox Thu, 20 May 2021 18:07:45 -0700


HyukjinKwon commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r636568884




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##########
@@ -441,81 +390,13 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
    * This function goes through the input once to determine the input schema. 
If you know the
    * schema in advance, use the version that specifies the schema to avoid the 
extra scan.
    *
-   * You can set the following JSON-specific options to deal with non-standard 
JSON files:
-   * <ul>
-   * <li>`primitivesAsString` (default `false`): infers all primitive values 
as a string type</li>
-   * <li>`prefersDecimal` (default `false`): infers all floating-point values 
as a decimal
-   * type. If the values do not fit in decimal, then it infers them as 
doubles.</li>
-   * <li>`allowComments` (default `false`): ignores Java/C++ style comment in 
JSON records</li>
-   * <li>`allowUnquotedFieldNames` (default `false`): allows unquoted JSON 
field names</li>
-   * <li>`allowSingleQuotes` (default `true`): allows single quotes in 
addition to double quotes
-   * </li>
-   * <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in 
numbers
-   * (e.g. 00012)</li>
-   * <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows 
accepting quoting of all
-   * character using backslash quoting mechanism</li>
-   * <li>`allowUnquotedControlChars` (default `false`): allows JSON Strings to 
contain unquoted
-   * control characters (ASCII characters with value less than 32, including 
tab and line feed
-   * characters) or not.</li>
-   * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
-   * during parsing.
-   *   <ul>
-   *     <li>`PERMISSIVE` : when it meets a corrupted record, puts the 
malformed string into a
-   *     field configured by `columnNameOfCorruptRecord`, and sets malformed 
fields to `null`. To
-   *     keep corrupt records, an user can set a string type field named
-   *     `columnNameOfCorruptRecord` in an user-defined schema. If a schema 
does not have the
-   *     field, it drops corrupt records during parsing. When inferring a 
schema, it implicitly
-   *     adds a `columnNameOfCorruptRecord` field in an output schema.</li>
-   *     <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
-   *     <li>`FAILFAST` : throws an exception when it meets corrupted 
records.</li>
-   *   </ul>
-   * </li>
-   * <li>`columnNameOfCorruptRecord` (default is the value specified in
-   * `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field 
having malformed string
-   * created by `PERMISSIVE` mode. This overrides 
`spark.sql.columnNameOfCorruptRecord`.</li>
-   * <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a 
date format.
-   * Custom date formats follow the formats at
-   * <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>
-   *   Datetime Patterns</a>.
-   * This applies to date type.</li>
-   * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets 
the string that
-   * indicates a timestamp format. Custom date formats follow the formats at
-   * <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>
-   *   Datetime Patterns</a>.
-   * This applies to timestamp type.</li>
-   * <li>`multiLine` (default `false`): parse one record, which may span 
multiple lines,
-   * per file</li>
-   * <li>`encoding` (by default it is not set): allows to forcibly set one of 
standard basic
-   * or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. 
If the encoding
-   * is not specified and `multiLine` is set to `true`, it will be detected 
automatically.</li>
-   * <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the 
line separator
-   * that should be used for parsing.</li>
-   * <li>`samplingRatio` (default is 1.0): defines fraction of input JSON 
objects used
-   * for schema inferring.</li>
-   * <li>`dropFieldIfAllNull` (default `false`): whether to ignore column of 
all null values or
-   * empty array/struct during schema inference.</li>
-   * <li>`locale` (default is `en-US`): sets a locale as language tag in IETF 
BCP 47 format.
-   * For instance, this is used while parsing dates and timestamps.</li>
-   * <li>`pathGlobFilter`: an optional glob pattern to only include files with 
paths matching
-   * the pattern. The syntax follows 
<code>org.apache.hadoop.fs.GlobFilter</code>.
-   * It does not change the behavior of partition discovery.</li>
-   * <li>`modifiedBefore` (batch only): an optional timestamp to only include 
files with
-   * modification times  occurring before the specified Time. The provided 
timestamp
-   * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)</li>
-   * <li>`modifiedAfter` (batch only): an optional timestamp to only include 
files with
-   * modification times occurring after the specified Time. The provided 
timestamp
-   * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)</li>
-   * <li>`recursiveFileLookup`: recursively scan a directory for files. Using 
this option
-   * disables partition discovery</li>
-   * <li>`allowNonNumericNumbers` (default `true`): allows JSON parser to 
recognize set of
-   * "Not-a-Number" (NaN) tokens as legal floating number values:
-   *   <ul>
-   *     <li>`+INF` for positive infinity, as well as alias of `+Infinity` and 
`Infinity`.
-   *     <li>`-INF` for negative infinity), alias `-Infinity`.
-   *     <li>`NaN` for other not-a-numbers, like result of division by zero.
-   *   </ul>
-   * </li>
-   * </ul>
+   * You can find the JSON-specific options for reading JSON files in
+   * <a 
href="https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option";>
+   *   Data Source Option</a> in the version you use.
+   * More general options can be found in
+   * <a href=
+   *   
"https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html";>
+   *   Generic Files Source Options</a> in the version you use.

Review comment:
       Shall we remove this too?

##########
File path: python/pyspark/sql/streaming.py
##########
@@ -507,102 +479,15 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
         schema : :class:`pyspark.sql.types.StructType` or str, optional
             an optional :class:`pyspark.sql.types.StructType` for the input 
schema
             or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
-        primitivesAsString : str or bool, optional
-            infers all primitive values as a string type. If None is set,
-            it uses the default value, ``false``.
-        prefersDecimal : str or bool, optional
-            infers all floating-point values as a decimal type. If the values
-            do not fit in decimal, then it infers them as doubles. If None is
-            set, it uses the default value, ``false``.
-        allowComments : str or bool, optional
-            ignores Java/C++ style comment in JSON records. If None is set,
-            it uses the default value, ``false``.
-        allowUnquotedFieldNames : str or bool, optional
-            allows unquoted JSON field names. If None is set,
-            it uses the default value, ``false``.
-        allowSingleQuotes : str or bool, optional
-            allows single quotes in addition to double quotes. If None is
-            set, it uses the default value, ``true``.
-        allowNumericLeadingZero : str or bool, optional
-            allows leading zeros in numbers (e.g. 00012). If None is
-            set, it uses the default value, ``false``.
-        allowBackslashEscapingAnyCharacter : str or bool, optional
-            allows accepting quoting of all character
-            using backslash quoting mechanism. If None is
-            set, it uses the default value, ``false``.
-        mode : str, optional
-            allows a mode for dealing with corrupt records during parsing. If 
None is
-            set, it uses the default value, ``PERMISSIVE``.
-
-            * ``PERMISSIVE``: when it meets a corrupted record, puts the 
malformed string \
-              into a field configured by ``columnNameOfCorruptRecord``, and 
sets malformed \
-              fields to ``null``. To keep corrupt records, an user can set a 
string type \
-              field named ``columnNameOfCorruptRecord`` in an user-defined 
schema. If a \
-              schema does not have the field, it drops corrupt records during 
parsing. \
-              When inferring a schema, it implicitly adds a 
``columnNameOfCorruptRecord`` \
-              field in an output schema.
-            *  ``DROPMALFORMED``: ignores the whole corrupted records.
-            *  ``FAILFAST``: throws an exception when it meets corrupted 
records.
-
-        columnNameOfCorruptRecord : str, optional
-            allows renaming the new field having malformed string
-            created by ``PERMISSIVE`` mode. This overrides
-            ``spark.sql.columnNameOfCorruptRecord``. If None is set,
-            it uses the value specified in
-            ``spark.sql.columnNameOfCorruptRecord``.
-        dateFormat : str, optional
-            sets the string that indicates a date format. Custom date formats
-            follow the formats at
-            `datetime pattern 
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.  # noqa
-            This applies to date type. If None is set, it uses the
-            default value, ``yyyy-MM-dd``.
-        timestampFormat : str, optional
-            sets the string that indicates a timestamp format.
-            Custom date formats follow the formats at
-            `datetime pattern 
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.  # noqa
-            This applies to timestamp type. If None is set, it uses the
-            default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``.
-        multiLine : str or bool, optional
-            parse one record, which may span multiple lines, per file. If None 
is
-            set, it uses the default value, ``false``.
-        allowUnquotedControlChars : str or bool, optional
-            allows JSON Strings to contain unquoted control
-            characters (ASCII characters with value less than 32,
-            including tab and line feed characters) or not.
-        lineSep : str, optional
-            defines the line separator that should be used for parsing. If 
None is
-            set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
-        locale : str, optional
-            sets a locale as language tag in IETF BCP 47 format. If None is 
set,
-            it uses the default value, ``en-US``. For instance, ``locale`` is 
used while
-            parsing dates and timestamps.
-        dropFieldIfAllNull : str or bool, optional
-            whether to ignore column of all null values or empty
-            array/struct during schema inference. If None is set, it
-            uses the default value, ``false``.
-        encoding : str or bool, optional
-            allows to forcibly set one of standard basic or extended encoding 
for
-            the JSON files. For example UTF-16BE, UTF-32LE. If None is set,
-            the encoding of input JSON will be detected automatically
-            when the multiLine option is set to ``true``.
-        pathGlobFilter : str or bool, optional
-            an optional glob pattern to only include files with paths matching
-            the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
-            It does not change the behavior of
-            `partition discovery 
<https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_.
  # noqa
-        recursiveFileLookup : str or bool, optional
-            recursively scan a directory for files. Using this option
-            disables
-            `partition discovery 
<https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_.
  # noqa
-        allowNonNumericNumbers : str or bool, optional
-            allows JSON parser to recognize set of "Not-a-Number" (NaN)
-            tokens as legal floating number values. If None is set,
-            it uses the default value, ``true``.
 
-                * ``+INF``: for positive infinity, as well as alias of
-                            ``+Infinity`` and ``Infinity``.
-                *  ``-INF``: for negative infinity, alias ``-Infinity``.
-                *  ``NaN``: for other not-a-numbers, like result of division 
by zero.
+        Other Parameters
+        ----------------
+        Extra options (keyword argument)
+            For the extra options, refer to
+            `Data Source Option 
<https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option>`_
  # noqa
+            and
+            `Generic File Source Options 
<https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html`>_  
# noqa
+            in the version you use.

Review comment:
       Shall we remove this too?

##########
File path: python/pyspark/sql/readwriter.py
##########
@@ -236,112 +190,15 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
         schema : :class:`pyspark.sql.types.StructType` or str, optional
             an optional :class:`pyspark.sql.types.StructType` for the input 
schema or
             a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
-        primitivesAsString : str or bool, optional
-            infers all primitive values as a string type. If None is set,
-            it uses the default value, ``false``.
-        prefersDecimal : str or bool, optional
-            infers all floating-point values as a decimal type. If the values
-            do not fit in decimal, then it infers them as doubles. If None is
-            set, it uses the default value, ``false``.
-        allowComments : str or bool, optional
-            ignores Java/C++ style comment in JSON records. If None is set,
-            it uses the default value, ``false``.
-        allowUnquotedFieldNames : str or bool, optional
-            allows unquoted JSON field names. If None is set,
-            it uses the default value, ``false``.
-        allowSingleQuotes : str or bool, optional
-            allows single quotes in addition to double quotes. If None is
-            set, it uses the default value, ``true``.
-        allowNumericLeadingZero : str or bool, optional
-            allows leading zeros in numbers (e.g. 00012). If None is
-            set, it uses the default value, ``false``.
-        allowBackslashEscapingAnyCharacter : str or bool, optional
-            allows accepting quoting of all character
-            using backslash quoting mechanism. If None is
-            set, it uses the default value, ``false``.
-        mode : str, optional
-            allows a mode for dealing with corrupt records during parsing. If 
None is
-                     set, it uses the default value, ``PERMISSIVE``.
-
-            * ``PERMISSIVE``: when it meets a corrupted record, puts the 
malformed string \
-              into a field configured by ``columnNameOfCorruptRecord``, and 
sets malformed \
-              fields to ``null``. To keep corrupt records, an user can set a 
string type \
-              field named ``columnNameOfCorruptRecord`` in an user-defined 
schema. If a \
-              schema does not have the field, it drops corrupt records during 
parsing. \
-              When inferring a schema, it implicitly adds a 
``columnNameOfCorruptRecord`` \
-              field in an output schema.
-            *  ``DROPMALFORMED``: ignores the whole corrupted records.
-            *  ``FAILFAST``: throws an exception when it meets corrupted 
records.
-
-        columnNameOfCorruptRecord: str, optional
-            allows renaming the new field having malformed string
-            created by ``PERMISSIVE`` mode. This overrides
-            ``spark.sql.columnNameOfCorruptRecord``. If None is set,
-            it uses the value specified in
-            ``spark.sql.columnNameOfCorruptRecord``.
-        dateFormat : str, optional
-            sets the string that indicates a date format. Custom date formats
-            follow the formats at
-            `datetime pattern 
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.  # noqa
-            This applies to date type. If None is set, it uses the
-            default value, ``yyyy-MM-dd``.
-        timestampFormat : str, optional
-            sets the string that indicates a timestamp format.
-            Custom date formats follow the formats at
-            `datetime pattern 
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.  # noqa
-            This applies to timestamp type. If None is set, it uses the
-            default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``.
-        multiLine : str or bool, optional
-            parse one record, which may span multiple lines, per file. If None 
is
-            set, it uses the default value, ``false``.
-        allowUnquotedControlChars : str or bool, optional
-            allows JSON Strings to contain unquoted control
-            characters (ASCII characters with value less than 32,
-            including tab and line feed characters) or not.
-        encoding : str or bool, optional
-            allows to forcibly set one of standard basic or extended encoding 
for
-            the JSON files. For example UTF-16BE, UTF-32LE. If None is set,
-            the encoding of input JSON will be detected automatically
-            when the multiLine option is set to ``true``.
-        lineSep : str, optional
-            defines the line separator that should be used for parsing. If 
None is
-            set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
-        samplingRatio : str or float, optional
-            defines fraction of input JSON objects used for schema inferring.
-            If None is set, it uses the default value, ``1.0``.
-        dropFieldIfAllNull : str or bool, optional
-            whether to ignore column of all null values or empty
-            array/struct during schema inference. If None is set, it
-            uses the default value, ``false``.
-        locale : str, optional
-            sets a locale as language tag in IETF BCP 47 format. If None is 
set,
-            it uses the default value, ``en-US``. For instance, ``locale`` is 
used while
-            parsing dates and timestamps.
-        pathGlobFilter : str or bool, optional
-            an optional glob pattern to only include files with paths matching
-            the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
-            It does not change the behavior of
-            `partition discovery 
<https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_.
  # noqa
-        recursiveFileLookup : str or bool, optional
-            recursively scan a directory for files. Using this option
-            disables
-            `partition discovery 
<https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_.
  # noqa
-        allowNonNumericNumbers : str or bool
-            allows JSON parser to recognize set of "Not-a-Number" (NaN)
-            tokens as legal floating number values. If None is set,
-            it uses the default value, ``true``.
-
-                * ``+INF``: for positive infinity, as well as alias of
-                            ``+Infinity`` and ``Infinity``.
-                *  ``-INF``: for negative infinity, alias ``-Infinity``.
-                *  ``NaN``: for other not-a-numbers, like result of division 
by zero.
-        modifiedBefore : an optional timestamp to only include files with
-            modification times occurring before the specified time. The 
provided timestamp
-            must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
-        modifiedAfter : an optional timestamp to only include files with
-            modification times occurring after the specified time. The 
provided timestamp
-            must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
 
+        Other Parameters
+        ----------------
+        Extra options
+            For the extra options, refer to
+            `Data Source Option 
<https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option>`_
  # noqa
+            and
+            `Generic File Source Options 
<https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html`>_  
# noqa
+            in the version you use.

Review comment:
       Shall we remove this too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

Reply via email to