[
https://issues.apache.org/jira/browse/SPARK-23194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582025#comment-16582025
]
Denis Bolshakov edited comment on SPARK-23194 at 8/16/18 6:33 AM:
------------------------------------------------------------------
[~cloud_fan], [~hyukjin.kwon], do you have any updates on this?
Javadoc says:
{code:java}
@param options options to control how the json is parsed. accepts the same
options and the
* json data source.
{code}
In fact it's not exactly true.
It' does not support `columnNameOfCorruptRecord` and `mode` options.
`mode` option is not supported because it's overridden in the source code, so
user's value is just ignored.
`columnNameOfCorruptRecord` is not supported because there is no way to set
PERMISSIVE mode.
See:
http://apache-spark-user-list.1001560.n3.nabble.com/from-json-function-td33209.html
and
https://github.com/apache/spark/blob/e2ab7deae76d3b6f41b9ad4d0ece14ea28db40ce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L568
It would be very nice to fix this or at least provide clear documentation for
options in from_json function.
The following snippet could be used to test (I've checked it on spark 2.0.2,
2.2.0, 2.3.0, 2.3.1)
{code}
import org.apache.spark.sql.functions._
val data = Seq(
"{'number': 1}",
"{'number': }"
)
val schema = new StructType()
.add($"number".int)
.add($"_corrupt_record".string)
val sourceDf = data.toDF("column")
val jsonedDf = sourceDf
.select(from_json(
$"column",
schema,
Map("mode" -> "PERMISSIVE", "columnNameOfCorruptRecord" ->
"_corrupt_record")
) as "data").selectExpr("data.number", "data._corrupt_record")
jsonedDf.show()
{code}
Kind regards,
Denis
was (Author: [email protected]):
[~cloud_fan], [~hyukjin.kwon], do you have any updates on this?
Javadoc says:
{code:java}
@param options options to control how the json is parsed. accepts the same
options and the
* json data source.
{code}
In fact it's not exactly true.
It' does not support `columnNameOfCorruptRecord` and `mode` options.
`mode` option is not supported because it's overridden in the source code, so
user's value is just ignored.
`columnNameOfCorruptRecord` is not supported because there is no way to set
PERMISSIVE mode.
See:
http://apache-spark-user-list.1001560.n3.nabble.com/from-json-function-td33209.html
and
https://github.com/apache/spark/blob/e2ab7deae76d3b6f41b9ad4d0ece14ea28db40ce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L568
It would be very nice to fix this or at least provide clear documentation for
options in from_json function.
Kind regards,
Denis
> from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls
> ---------------------------------------------------------------------------
>
> Key: SPARK-23194
> URL: https://issues.apache.org/jira/browse/SPARK-23194
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Burak Yavuz
> Priority: Major
>
> from_json accepts Json parsing options such as being PERMISSIVE to parsing
> errors or failing fast. It seems from the code that even though the default
> option is to fail fast, we catch that exception and return nulls.
>
> In order to not change behavior, we should remove that try-catch block and
> change the default to permissive, but allow failfast mode to indeed fail.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]