[ https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548617#comment-14548617 ]
Apache Spark commented on SPARK-7696: ------------------------------------- User 'ogirardot' has created a pull request for this issue: https://github.com/apache/spark/pull/6237 > Aggregate function's result should be nullable only if the input expression > is nullable > --------------------------------------------------------------------------------------- > > Key: SPARK-7696 > URL: https://issues.apache.org/jira/browse/SPARK-7696 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0, 1.3.1 > Reporter: Haopu Wang > Priority: Minor > > In SparkSQL, the aggregate function's result currently is always nullable. > It will make sense to change the behavior as: if the input expression is > nullable, the result is nullable; Otherwise, the result is non-nullable. > Please see the following discussion: > >>>>>>>>>>>>>>> > From: Olivier Girardot [mailto:ssab...@gmail.com] > Sent: Tuesday, May 12, 2015 5:12 AM > To: Reynold Xin > Cc: Haopu Wang; user > Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable? > > I'll look into it - not sure yet what I can get out of exprs :p > > Le lun. 11 mai 2015 à 22:35, Reynold Xin <r...@databricks.com> a écrit : > Thanks for catching this. I didn't read carefully enough. > > It'd make sense to have the udaf result be non-nullable, if the exprs are > indeed non-nullable. > > On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot <ssab...@gmail.com> wrote: > Hi Haopu, > actually here `key` is nullable because this is your input's schema : > scala> result.printSchema > root > |-- key: string (nullable = true) > |-- SUM(value): long (nullable = true) > scala> df.printSchema > root > |-- key: string (nullable = true) > |-- value: long (nullable = false) > > I tried it with a schema where the key is not flagged as nullable, and the > schema is actually respected. What you can argue however is that SUM(value) > should also be not nullable since value is not nullable. > > @rxin do you think it would be reasonable to flag the Sum aggregation > function as nullable (or not) depending on the input expression's schema ? > > Regards, > > Olivier. > Le lun. 11 mai 2015 à 22:07, Reynold Xin <r...@databricks.com> a écrit : > Not by design. Would you be interested in submitting a pull request? > > On Mon, May 11, 2015 at 1:48 AM, Haopu Wang <hw...@qilinsoft.com> wrote: > I try to get the result schema of aggregate functions using DataFrame > API. > However, I find the result field of groupBy columns are always nullable > even the source field is not nullable. > I want to know if this is by design, thank you! Below is the simple code > to show the issue. > ====== > import sqlContext.implicits._ > import org.apache.spark.sql.functions._ > case class Test(key: String, value: Long) > val df = sc.makeRDD(Seq(Test("k1",2),Test("k1",1))).toDF > val result = df.groupBy("key").agg($"key", sum("value")) > // From the output, you can see the "key" column is nullable, why?? > result.printSchema > // root > // |-- key: string (nullable = true) > // |-- SUM(value): long (nullable = true) > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org