[jira] [Updated] (SPARK-15550) Dataset.show() doesn't disply inner nested structs properly

Cheng Lian (JIRA) Wed, 25 May 2016 23:18:46 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-15550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Cheng Lian updated SPARK-15550:
-------------------------------
    Description: 
The following Spark shell snippet reproduces this issue:

{code}
case class ClassData(a: String, b: Long)
case class NestedStruct(f: ClassData)

val data = Seq(
  "{'f': {'b': 1, 'a': 'foo', 'c': 'extra'}}",
  "{'f': {'b': 2, 'a': 'bar', 'c': 'extra'}}"
)

val df = spark.read.json(sc.parallelize(data))
val ds = df.as[NestedStruct]
{code}

Exception thrown:

{noformat}
org.apache.spark.sql.AnalysisException: Try to map 
struct<a:string,b:bigint,c:string> to Tuple2, but failed as the number of 
fields does not line up.
 - Input schema: struct<f:struct<a:string,b:bigint,c:string>>
 - Target schema: struct<f:struct<a:string,b:bigint>>;
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:267)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$validate$3.apply(ExpressionEncoder.scala:311)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$validate$3.apply(ExpressionEncoder.scala:307)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:307)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:201)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:168)
  at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
  at org.apache.spark.sql.Dataset.as(Dataset.scala:360)
  ... 39 elided
{noformat}

We can workaround this exception by removing inner nested field {{c}}:

{code}
val data = Seq(
  "{'f': {'b': 1, 'a': 'foo'}}",
  "{'f': {'b': 2, 'a': 'bar'}}"
)

{code}

Dataset is just a "view" of its underlying logical plan. The encoder and 
logical plan of the Dataset may have different field number and/or field order. 
This is OK as long as all fields referred by the encoder exist in the logical 
plan. This is because encoder de/serializer expressions are resovled by name 
rather than ordinal.

However, as illustrated above, length of an inner nested struct must align with 
field number of the corresponding case class/Java bean.


  was:
Say we have the following nested case class:

{code}
case class ClassData(a: String, b: Int)
case class NestedStruct(f: ClassData)
{code}

For a Dataset {{ds}} of {{NestedStruct}}, {{ds.show()}} should convert all case 
class instances, including the inner nested {{ClassData}}, into {{Row}} 
instances before displaying them. However, {{ClassData}} instances are just 
displayed using {{toString}}.

{code}
val data =
  s"""{"f": {"b": 1, "a": "foo"}}
     |{"f": {"b": 2, "a": "bar"}}
     |""".stripMargin.trim.split("\n")

val df = spark.read.json(sc.parallelize(data))
val ds = df.as[NestedStruct]
{code}

Actual output:

{noformat}
+----------------+
|               f|
+----------------+
|ClassData(foo,1)|
|ClassData(bar,2)|
+----------------+
{noformat}

Expected output:

{noformat}
+-------+
|      f|
+-------+
|[1,foo]|
|[2,bar]|
+-------+
{noformat}

This is not too big a deal for Scala users since Scala case classes always come 
with a well defined default {{toString}} method. But Java beans don't.

Another point is that, Dataset is just a view of the underlying logical plan, 
and the domain object type may not refer to all fields defined in the 
underlying logical plan. However, users are still allowed to access these extra 
fields using methods like {{Dataset.col}}. Due to this consideration, we 
decided to let {{Dataset.show()}} directly delegate to 
{{Dataset.toDF().show()}}, which shows all fields defined in the logical plan.



> Dataset.show() doesn't disply inner nested structs properly
> -----------------------------------------------------------
>
>                 Key: SPARK-15550
>                 URL: https://issues.apache.org/jira/browse/SPARK-15550
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1, 2.0.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>
> The following Spark shell snippet reproduces this issue:
> {code}
> case class ClassData(a: String, b: Long)
> case class NestedStruct(f: ClassData)
> val data = Seq(
>   "{'f': {'b': 1, 'a': 'foo', 'c': 'extra'}}",
>   "{'f': {'b': 2, 'a': 'bar', 'c': 'extra'}}"
> )
> val df = spark.read.json(sc.parallelize(data))
> val ds = df.as[NestedStruct]
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Try to map 
> struct<a:string,b:bigint,c:string> to Tuple2, but failed as the number of 
> fields does not line up.
>  - Input schema: struct<f:struct<a:string,b:bigint,c:string>>
>  - Target schema: struct<f:struct<a:string,b:bigint>>;
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:267)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$validate$3.apply(ExpressionEncoder.scala:311)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$validate$3.apply(ExpressionEncoder.scala:307)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:307)
>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:168)
>   at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
>   at org.apache.spark.sql.Dataset.as(Dataset.scala:360)
>   ... 39 elided
> {noformat}
> We can workaround this exception by removing inner nested field {{c}}:
> {code}
> val data = Seq(
>   "{'f': {'b': 1, 'a': 'foo'}}",
>   "{'f': {'b': 2, 'a': 'bar'}}"
> )
> {code}
> Dataset is just a "view" of its underlying logical plan. The encoder and 
> logical plan of the Dataset may have different field number and/or field 
> order. This is OK as long as all fields referred by the encoder exist in the 
> logical plan. This is because encoder de/serializer expressions are resovled 
> by name rather than ordinal.
> However, as illustrated above, length of an inner nested struct must align 
> with field number of the corresponding case class/Java bean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-15550) Dataset.show() doesn't disply inner nested structs properly

Reply via email to