Reynold Xin created SPARK-11745:
-----------------------------------

             Summary: Enable more JSON parsing options
                 Key: SPARK-11745
                 URL: https://issues.apache.org/jira/browse/SPARK-11745
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Reynold Xin


As a user, I want to be able to read non-standard JSON files. Jackson itself 
includes a few options that we should allow users to specify:

- ALLOW_COMMENTS
- ALLOW_UNQUOTED_FIELD_NAMES
- ALLOW_SINGLE_QUOTES
- ALLOW_NUMERIC_LEADING_ZEROS
- ALLOW_NON_NUMERIC_NUMBERS

After this change, the following options are still unsupported:
- ALLOW_YAML_COMMENTS
- ALLOW_UNQUOTED_CONTROL_CHARS
- ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER

See the Jackson source code pasted below for the definition of these config 
options:
{code}

        /**
         * Feature that determines whether parser will allow use
         * of Java/C++ style comments (both '/'+'*' and
         * '//' varieties) within parsed content or not.
         *<p>
         * Since JSON specification does not mention comments as legal
         * construct,
         * this is a non-standard feature; however, in the wild
         * this is extensively used. As such, feature is
         * <b>disabled by default</b> for parsers and must be
         * explicitly enabled.
         */
        ALLOW_COMMENTS(false),

        /**
         * Feature that determines whether parser will allow use
         * of YAML comments, ones starting with '#' and continuing
         * until the end of the line. This commenting style is common
         * with scripting languages as well.
         *<p>
         * Since JSON specification does not mention comments as legal
         * construct,
         * this is a non-standard feature. As such, feature is
         * <b>disabled by default</b> for parsers and must be
         * explicitly enabled.
         */
        ALLOW_YAML_COMMENTS(false),
        
        /**
         * Feature that determines whether parser will allow use
         * of unquoted field names (which is allowed by Javascript,
         * but not by JSON specification).
         *<p>
         * Since JSON specification requires use of double quotes for
         * field names,
         * this is a non-standard feature, and as such disabled by default.
         */
        ALLOW_UNQUOTED_FIELD_NAMES(false),

        /**
         * Feature that determines whether parser will allow use
         * of single quotes (apostrophe, character '\'') for
         * quoting Strings (names and String values). If so,
         * this is in addition to other acceptabl markers.
         * but not by JSON specification).
         *<p>
         * Since JSON specification requires use of double quotes for
         * field names,
         * this is a non-standard feature, and as such disabled by default.
         */
        ALLOW_SINGLE_QUOTES(false),

        /**
         * Feature that determines whether parser will allow
         * JSON Strings to contain unquoted control characters
         * (ASCII characters with value less than 32, including
         * tab and line feed characters) or not.
         * If feature is set false, an exception is thrown if such a
         * character is encountered.
         *<p>
         * Since JSON specification requires quoting for all control characters,
         * this is a non-standard feature, and as such disabled by default.
         */
        ALLOW_UNQUOTED_CONTROL_CHARS(false),

        /**
         * Feature that can be enabled to accept quoting of all character
         * using backslash qooting mechanism: if not enabled, only characters
         * that are explicitly listed by JSON specification can be thus
         * escaped (see JSON spec for small list of these characters)
         *<p>
         * Since JSON specification requires quoting for all control characters,
         * this is a non-standard feature, and as such disabled by default.
         */
        ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER(false),

        /**
         * Feature that determines whether parser will allow
         * JSON integral numbers to start with additional (ignorable) 
         * zeroes (like: 000001). If enabled, no exception is thrown, and extra
         * nulls are silently ignored (and not included in textual 
representation
         * exposed via {@link JsonParser#getText}).
         *<p>
         * Since JSON specification does not allow leading zeroes,
         * this is a non-standard feature, and as such disabled by default.
         */
        ALLOW_NUMERIC_LEADING_ZEROS(false),
        
        /**
         * Feature that allows parser to recognize set of
         * "Not-a-Number" (NaN) tokens as legal floating number
         * values (similar to how many other data formats and
         * programming language source code allows it).
         * Specific subset contains values that
         * <a href="http://www.w3.org/TR/xmlschema-2/";>XML Schema</a>
         * (see section 3.2.4.1, Lexical Representation)
         * allows (tokens are quoted contents, not including quotes):
         *<ul>
         *  <li>"INF" (for positive infinity), as well as alias of "Infinity"
         *  <li>"-INF" (for negative infinity), alias "-Infinity"
         *  <li>"NaN" (for other not-a-numbers, like result of division by zero)
         *</ul>
         *<p>
         * Since JSON specification does not allow use of such values,
         * this is a non-standard feature, and as such disabled by default.
         */
         ALLOW_NON_NUMERIC_NUMBERS(false),
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to