[
https://issues.apache.org/jira/browse/SPARK-11949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026495#comment-15026495
]
Yanbo Liang edited comment on SPARK-11949 at 11/25/15 9:38 AM:
---------------------------------------------------------------
This should be caused by Int type was set nullable = false. If we filter on
String type, it will work well.
{code}
scala> cube0.where("room_name IS NULL").show()
+--------+----+------+---------+---------+
| date|hour|minute|room_name|avg(temp)|
+--------+----+------+---------+---------+
| null|null| 36| null| 21.5|
| null|null| 35| null| 20.5|
|20151123| 18| 36| null| 21.5|
|20151123| 18| 35| null| 20.5|
| null| 18| null| null| 21.0|
|20151123|null| 36| null| 21.5|
|20151123|null| 35| null| 20.5|
| null|null| null| null| 21.0|
|20151123| 18| null| null| 21.0|
| null| 18| 36| null| 21.5|
| null| 18| 35| null| 20.5|
|20151123|null| null| null| 21.0|
+--------+----+------+---------+---------+
{code}
I think cube operator should modify the original nullable attribute.
was (Author: yanboliang):
This should be caused by Int type was set nullable = false. If we filter on
String type, it will work well.
{code}
cube0.where("room_name IS NULL").show()
{code}
I think cube operator should modify the original nullable attribute.
> Query on DataFrame from cube gives wrong results
> ------------------------------------------------
>
> Key: SPARK-11949
> URL: https://issues.apache.org/jira/browse/SPARK-11949
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.1
> Reporter: Veli Kerim Celik
> Labels: dataframe, sql
>
> {code:title=Reproduce bug|borderStyle=solid}
> case class fact(date: Int, hour: Int, minute: Int, room_name: String, temp:
> Double)
> val df0 = sc.parallelize(Seq
> (
> fact(20151123, 18, 35, "room1", 18.6),
> fact(20151123, 18, 35, "room2", 22.4),
> fact(20151123, 18, 36, "room1", 17.4),
> fact(20151123, 18, 36, "room2", 25.6)
> )).toDF()
> val cube0 = df0.cube("date", "hour", "minute", "room_name").agg(Map
> (
> "temp" -> "avg"
> ))
> cube0.where("date IS NULL").show()
> {code}
> The query result is empty. It should not be, because cube0 contains the value
> null several times in column 'date'. The issue arises because the cube
> function reuses the schema information from df0. If I change the type of
> parameters in the case class to Option[T] the query gives correct results.
> Solution: The cube function should change the schema by changing the nullable
> property to true, for the columns (dimensions) specified in the method call
> parameters.
> I am new at Scala and Spark. I don't know how to implement this. Somebody
> please do.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]