HyukjinKwon commented on a change in pull request #32204:
URL: https://github.com/apache/spark/pull/32204#discussion_r635793237
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##########
@@ -441,81 +390,13 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
* This function goes through the input once to determine the input schema.
If you know the
* schema in advance, use the version that specifies the schema to avoid the
extra scan.
*
- * You can set the following JSON-specific options to deal with non-standard
JSON files:
- * <ul>
- * <li>`primitivesAsString` (default `false`): infers all primitive values
as a string type</li>
- * <li>`prefersDecimal` (default `false`): infers all floating-point values
as a decimal
- * type. If the values do not fit in decimal, then it infers them as
doubles.</li>
- * <li>`allowComments` (default `false`): ignores Java/C++ style comment in
JSON records</li>
- * <li>`allowUnquotedFieldNames` (default `false`): allows unquoted JSON
field names</li>
- * <li>`allowSingleQuotes` (default `true`): allows single quotes in
addition to double quotes
- * </li>
- * <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in
numbers
- * (e.g. 00012)</li>
- * <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows
accepting quoting of all
- * character using backslash quoting mechanism</li>
- * <li>`allowUnquotedControlChars` (default `false`): allows JSON Strings to
contain unquoted
- * control characters (ASCII characters with value less than 32, including
tab and line feed
- * characters) or not.</li>
- * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt
records
- * during parsing.
- * <ul>
- * <li>`PERMISSIVE` : when it meets a corrupted record, puts the
malformed string into a
- * field configured by `columnNameOfCorruptRecord`, and sets malformed
fields to `null`. To
- * keep corrupt records, an user can set a string type field named
- * `columnNameOfCorruptRecord` in an user-defined schema. If a schema
does not have the
- * field, it drops corrupt records during parsing. When inferring a
schema, it implicitly
- * adds a `columnNameOfCorruptRecord` field in an output schema.</li>
- * <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
- * <li>`FAILFAST` : throws an exception when it meets corrupted
records.</li>
- * </ul>
- * </li>
- * <li>`columnNameOfCorruptRecord` (default is the value specified in
- * `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field
having malformed string
- * created by `PERMISSIVE` mode. This overrides
`spark.sql.columnNameOfCorruptRecord`.</li>
- * <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a
date format.
- * Custom date formats follow the formats at
- * <a
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">
- * Datetime Patterns</a>.
- * This applies to date type.</li>
- * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets
the string that
- * indicates a timestamp format. Custom date formats follow the formats at
- * <a
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">
- * Datetime Patterns</a>.
- * This applies to timestamp type.</li>
- * <li>`multiLine` (default `false`): parse one record, which may span
multiple lines,
- * per file</li>
- * <li>`encoding` (by default it is not set): allows to forcibly set one of
standard basic
- * or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE.
If the encoding
- * is not specified and `multiLine` is set to `true`, it will be detected
automatically.</li>
- * <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the
line separator
- * that should be used for parsing.</li>
- * <li>`samplingRatio` (default is 1.0): defines fraction of input JSON
objects used
- * for schema inferring.</li>
- * <li>`dropFieldIfAllNull` (default `false`): whether to ignore column of
all null values or
- * empty array/struct during schema inference.</li>
- * <li>`locale` (default is `en-US`): sets a locale as language tag in IETF
BCP 47 format.
- * For instance, this is used while parsing dates and timestamps.</li>
- * <li>`pathGlobFilter`: an optional glob pattern to only include files with
paths matching
- * the pattern. The syntax follows
<code>org.apache.hadoop.fs.GlobFilter</code>.
- * It does not change the behavior of partition discovery.</li>
- * <li>`modifiedBefore` (batch only): an optional timestamp to only include
files with
- * modification times occurring before the specified Time. The provided
timestamp
- * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g.
2020-06-01T13:00:00)</li>
- * <li>`modifiedAfter` (batch only): an optional timestamp to only include
files with
- * modification times occurring after the specified Time. The provided
timestamp
- * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g.
2020-06-01T13:00:00)</li>
- * <li>`recursiveFileLookup`: recursively scan a directory for files. Using
this option
- * disables partition discovery</li>
- * <li>`allowNonNumericNumbers` (default `true`): allows JSON parser to
recognize set of
- * "Not-a-Number" (NaN) tokens as legal floating number values:
- * <ul>
- * <li>`+INF` for positive infinity, as well as alias of `+Infinity` and
`Infinity`.
- * <li>`-INF` for negative infinity), alias `-Infinity`.
- * <li>`NaN` for other not-a-numbers, like result of division by zero.
- * </ul>
- * </li>
- * </ul>
+ * You can find the JSON-specific options for reading JSON files in
Review comment:
ditto. It says JSON specific options but it mentions "Generic Files
Source Options"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]