[jira] [Comment Edited] (SPARK-23194) from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls

Denis Bolshakov (JIRA) Wed, 15 Aug 2018 23:34:38 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-23194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582025#comment-16582025
 ]


Denis Bolshakov edited comment on SPARK-23194 at 8/16/18 6:33 AM:
------------------------------------------------------------------

[~cloud_fan], [~hyukjin.kwon], do you have any updates on this?

 

Javadoc says:
{code:java}
@param options options to control how the json is parsed. accepts the same 
options and the
*                json data source.
{code}

In fact it's not exactly true.
It' does not support `columnNameOfCorruptRecord` and `mode` options.
`mode` option is not supported because it's overridden in the source code, so 
user's value is just ignored.
 `columnNameOfCorruptRecord` is not supported because there is no way to set 
PERMISSIVE mode.

See:
http://apache-spark-user-list.1001560.n3.nabble.com/from-json-function-td33209.html
and
https://github.com/apache/spark/blob/e2ab7deae76d3b6f41b9ad4d0ece14ea28db40ce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L568

It would be very nice to fix this or at least provide clear documentation for 
options in from_json function.


The following snippet could be used to test (I've checked it on spark 2.0.2, 
2.2.0, 2.3.0, 2.3.1)
{code}
 import org.apache.spark.sql.functions._

    val data = Seq(
      "{'number': 1}",
      "{'number': }"
    )

    val schema = new StructType()
      .add($"number".int)
      .add($"_corrupt_record".string)

    val sourceDf = data.toDF("column")

    val jsonedDf = sourceDf
      .select(from_json(
        $"column",
        schema,
        Map("mode" -> "PERMISSIVE", "columnNameOfCorruptRecord" ->
"_corrupt_record")
      ) as "data").selectExpr("data.number", "data._corrupt_record")

      jsonedDf.show()
{code}

Kind regards,
Denis


was (Author: [email protected]):
[~cloud_fan], [~hyukjin.kwon], do you have any updates on this?

 

Javadoc says:
{code:java}
@param options options to control how the json is parsed. accepts the same 
options and the
*                json data source.
{code}

In fact it's not exactly true.
It' does not support `columnNameOfCorruptRecord` and `mode` options.
`mode` option is not supported because it's overridden in the source code, so 
user's value is just ignored.
 `columnNameOfCorruptRecord` is not supported because there is no way to set 
PERMISSIVE mode.

See:
http://apache-spark-user-list.1001560.n3.nabble.com/from-json-function-td33209.html
and
https://github.com/apache/spark/blob/e2ab7deae76d3b6f41b9ad4d0ece14ea28db40ce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L568

It would be very nice to fix this or at least provide clear documentation for 
options in from_json function.

Kind regards,
Denis

> from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-23194
>                 URL: https://issues.apache.org/jira/browse/SPARK-23194
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Burak Yavuz
>            Priority: Major
>
> from_json accepts Json parsing options such as being PERMISSIVE to parsing 
> errors or failing fast. It seems from the code that even though the default 
> option is to fail fast, we catch that exception and return nulls.
>  
> In order to not change behavior, we should remove that try-catch block and 
> change the default to permissive, but allow failfast mode to indeed fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-23194) from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls

Reply via email to