Cheng Lian created SPARK-15550:
----------------------------------
Summary: Dataset.show() doesn't disply inner nested structs
properly
Key: SPARK-15550
URL: https://issues.apache.org/jira/browse/SPARK-15550
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.1, 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Say we have the following nested case class:
{code}
case class ClassData(a: String, b: Int)
case class NestedStruct(f: ClassData)
{code}
For a Dataset {{ds}} of {{NestedStruct}}, {{ds.show()}} should convert all case
class instances, including the inner nested {{ClassData}}, into {{Row}}
instances before displaying them. However, {{ClassData}} instances are just
displayed using {{toString}}.
{code}
val data =
s"""{"f": {"b": 1, "a": "foo"}}
|{"f": {"b": 2, "a": "bar"}}
|""".stripMargin.trim.split("\n")
val df = spark.read.json(sc.parallelize(data))
val ds = df.as[NestedStruct]
{code}
Actual output:
{noformat}
+----------------+
| f|
+----------------+
|ClassData(foo,1)|
|ClassData(bar,2)|
+----------------+
{noformat}
Expected output:
{noformat}
+-------+
| f|
+-------+
|[1,foo]|
|[2,bar]|
+-------+
{noformat}
This is not too big a deal for Scala users since Scala case classes always come
with a well defined default {{toString}} method. But Java beans don't.
Another point is that, Dataset is just a view of the underlying logical plan,
and the domain object type may not refer to all fields defined in the
underlying logical plan. However, users are still allowed to access these extra
fields using methods like {{Dataset.col}}. Due to this consideration, we
decided to let {{Dataset.show()}} directly delegate to
{{Dataset.toDF().show()}}, which shows all fields defined in the logical plan.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]