Reynold Xin created SPARK-11745:
-----------------------------------
Summary: Enable more JSON parsing options
Key: SPARK-11745
URL: https://issues.apache.org/jira/browse/SPARK-11745
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Reynold Xin
As a user, I want to be able to read non-standard JSON files. Jackson itself
includes a few options that we should allow users to specify:
- ALLOW_COMMENTS
- ALLOW_UNQUOTED_FIELD_NAMES
- ALLOW_SINGLE_QUOTES
- ALLOW_NUMERIC_LEADING_ZEROS
- ALLOW_NON_NUMERIC_NUMBERS
After this change, the following options are still unsupported:
- ALLOW_YAML_COMMENTS
- ALLOW_UNQUOTED_CONTROL_CHARS
- ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER
See the Jackson source code pasted below for the definition of these config
options:
{code}
/**
* Feature that determines whether parser will allow use
* of Java/C++ style comments (both '/'+'*' and
* '//' varieties) within parsed content or not.
*<p>
* Since JSON specification does not mention comments as legal
* construct,
* this is a non-standard feature; however, in the wild
* this is extensively used. As such, feature is
* <b>disabled by default</b> for parsers and must be
* explicitly enabled.
*/
ALLOW_COMMENTS(false),
/**
* Feature that determines whether parser will allow use
* of YAML comments, ones starting with '#' and continuing
* until the end of the line. This commenting style is common
* with scripting languages as well.
*<p>
* Since JSON specification does not mention comments as legal
* construct,
* this is a non-standard feature. As such, feature is
* <b>disabled by default</b> for parsers and must be
* explicitly enabled.
*/
ALLOW_YAML_COMMENTS(false),
/**
* Feature that determines whether parser will allow use
* of unquoted field names (which is allowed by Javascript,
* but not by JSON specification).
*<p>
* Since JSON specification requires use of double quotes for
* field names,
* this is a non-standard feature, and as such disabled by default.
*/
ALLOW_UNQUOTED_FIELD_NAMES(false),
/**
* Feature that determines whether parser will allow use
* of single quotes (apostrophe, character '\'') for
* quoting Strings (names and String values). If so,
* this is in addition to other acceptabl markers.
* but not by JSON specification).
*<p>
* Since JSON specification requires use of double quotes for
* field names,
* this is a non-standard feature, and as such disabled by default.
*/
ALLOW_SINGLE_QUOTES(false),
/**
* Feature that determines whether parser will allow
* JSON Strings to contain unquoted control characters
* (ASCII characters with value less than 32, including
* tab and line feed characters) or not.
* If feature is set false, an exception is thrown if such a
* character is encountered.
*<p>
* Since JSON specification requires quoting for all control characters,
* this is a non-standard feature, and as such disabled by default.
*/
ALLOW_UNQUOTED_CONTROL_CHARS(false),
/**
* Feature that can be enabled to accept quoting of all character
* using backslash qooting mechanism: if not enabled, only characters
* that are explicitly listed by JSON specification can be thus
* escaped (see JSON spec for small list of these characters)
*<p>
* Since JSON specification requires quoting for all control characters,
* this is a non-standard feature, and as such disabled by default.
*/
ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER(false),
/**
* Feature that determines whether parser will allow
* JSON integral numbers to start with additional (ignorable)
* zeroes (like: 000001). If enabled, no exception is thrown, and extra
* nulls are silently ignored (and not included in textual
representation
* exposed via {@link JsonParser#getText}).
*<p>
* Since JSON specification does not allow leading zeroes,
* this is a non-standard feature, and as such disabled by default.
*/
ALLOW_NUMERIC_LEADING_ZEROS(false),
/**
* Feature that allows parser to recognize set of
* "Not-a-Number" (NaN) tokens as legal floating number
* values (similar to how many other data formats and
* programming language source code allows it).
* Specific subset contains values that
* <a href="http://www.w3.org/TR/xmlschema-2/">XML Schema</a>
* (see section 3.2.4.1, Lexical Representation)
* allows (tokens are quoted contents, not including quotes):
*<ul>
* <li>"INF" (for positive infinity), as well as alias of "Infinity"
* <li>"-INF" (for negative infinity), alias "-Infinity"
* <li>"NaN" (for other not-a-numbers, like result of division by zero)
*</ul>
*<p>
* Since JSON specification does not allow use of such values,
* this is a non-standard feature, and as such disabled by default.
*/
ALLOW_NON_NUMERIC_NUMBERS(false),
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]