[jira] [Comment Edited] (SPARK-17681) Empty DataFrame with non-zero rows after using drop

Ian Hellstrom (JIRA) Tue, 27 Sep 2016 23:56:38 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528639#comment-15528639
 ]


Ian Hellstrom edited comment on SPARK-17681 at 9/28/16 6:55 AM:
----------------------------------------------------------------

I understand where the logic comes from, especially when you think about a 
{{DataFrame}} as a {{List[TupleN]}}:

{code}
List((1, "foo"), (2, "bar")) # drop both 'columns' leads to...
List((), ())                 # List.length = 2
{code}

You could argue that you now have an empty table with rows or a non-empty table 
without any rows. Both sound distinctly suspicious to my ears.

What I don't subscribe to is that the order does not matter. You implicitly 
assume that dropping columns and adding must be a commutative operation, from 
which it follows that the number of rows should not change. I fail to see why. 
Renaming, for instance, is definitely not commutative, i.e. it depends on the 
order in which the operations are performed, at least if you assume no-op 
renaming (which is an identity operation when a column you want to rename does 
not exist or the target identifier would cause a naming clash).

[Codd|http://dl.acm.org/citation.cfm?doid=362384.362685] does not explicitly 
say anything about the number of elements a tuple (i.e. row) is supposed to 
have (n but he does not provide a lower bound), but the SQL standard (ISO/IEC 
13249-1:2007 Part 1: Framework) is quite clear: 

>> A table is a collection of zero or more rows where each row is a sequence of 
>> one or more column values.

Most major RDBMSs also do not allow you to drop all columns from a table. 
Example: [Oracle|http://psoug.org/oraerror/ORA-12983.htm].

Moreover, I don't think that {{withColumn}} actually ought to increase the 
number of rows, because it already doesn't do that when you have a truly empty 
data frame:

{code}
import org.apache.spark.sql.functions._

ss.emptyDataFrame.withColumn("x", expr("1")).count   # equals 0
{code}

What it does is add a fixed expression to all rows that are present. If no rows 
are present, then it simply adds a placeholder for the column.

So, my suggestion is that either a) the behaviour changes when all columns are 
dropped (leading to an empty data frame or data set) or b) the documentation 
must clearly state that dropping all columns from a {{DataFrame}} or 
{{Dataset}} does not lead to the rows being dropped even though they have 
absolutely no content (i.e. they are not even {{NULL}}). I fully understand 
that a {{DataFrame}} or {{Dataset}} is not a relational table, but since Spark 
SQL is often used with relational databases, this is behaviour that people may 
find surprising, at least I do.

I am aware that in the case of a change in the behaviour, it may be surprising 
to people who think about these data structures in terms of lists (or 
multisets) of tuples.


was (Author: hellstorm):
I understand where the logic comes from, especially when you think about a 
{{DataFrame}} as a {{List[TupleN]}}:

{code}
List((1, "foo"), (2, "bar")) # drop both 'columns' leads to...
List((), ())                 # List.length = 2
{code}

You could argue that you now have an empty table with rows or a non-empty table 
without any rows. Both sound distinctly suspicious to my ears.

What I don't subscribe to is that the order does not matter. You implicitly 
assume that dropping columns and adding must be a commutative operation, from 
which it follows that the number of rows should not change. I fail to see why. 
Renaming, for instance, is definitely not commutative, i.e. it depends on the 
order in which the operations are performed, at least if you assume no-op 
renaming (which is an identity operation when a column you want to rename does 
not exist or the target identifier would cause a naming clash).

[Codd|http://dl.acm.org/citation.cfm?doid=362384.362685] does not explicitly 
say anything about the number of elements a tuple (i.e. row) is supposed to 
have (n but he does not provide a lower bound), but the SQL standard (ISO/IEC 
13249-1:2007 Part 1: Framework) is quite clear: 

> A table is a collection of zero or more rows where each row is a sequence of 
> one or more column values.

Most major RDBMSs also do not allow you to drop all columns from a table. 
Example: [Oracle|http://psoug.org/oraerror/ORA-12983.htm].

Moreover, I don't think that {{withColumn}} actually ought to increase the 
number of rows, because it already doesn't do that when you have a truly empty 
data frame:

{code}
import org.apache.spark.sql.functions._

ss.emptyDataFrame.withColumn("x", expr("1")).count   # equals 0
{code}

What it does is add a fixed expression to all rows that are present. If no rows 
are present, then it simply adds a placeholder for the column.

So, my suggestion is that either a) the behaviour changes when all columns are 
dropped (leading to an empty data frame or data set) or b) the documentation 
must clearly state that dropping all columns from a {{DataFrame}} or 
{{Dataset}} does not lead to the rows being dropped even though they have 
absolutely no content (i.e. they are not even {{NULL}}). I fully understand 
that a {{DataFrame}} or {{Dataset}} is not a relational table, but since Spark 
SQL is often used with relational databases, this is behaviour that people may 
find surprising, at least I do.

I am aware that in the case of a change in the behaviour, it may be surprising 
to people who think about these data structures in terms of lists (or 
multisets) of tuples.

> Empty DataFrame with non-zero rows after using drop
> ---------------------------------------------------
>
>                 Key: SPARK-17681
>                 URL: https://issues.apache.org/jira/browse/SPARK-17681
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1, 1.6.0, 2.0.0
>            Reporter: Ian Hellstrom
>
> It is possible to have a {{DataFrame}} with no columns to have a non-zero 
> number of rows, even though the contents are empty:
> {code}
> val df = Seq((1,2)).toDF("a", "b")
> df.drop("a").drop("b").count
> {code}
> The problem is also present in 2.0.0:
> {code}
> import org.apache.spark._
> import org.apache.spark.sql._
> val conf = new SparkConf()
> val sc = new SparkContext("local", "demo", conf)
> val ss = SparkSession.builder.getOrCreate()
> import ss.implicits._
> case class Data(a: Int, b: Int)
> val rdd = sc.parallelize(List(Data(1,2)))
> val ds = ss.createDataset(rdd)
> ds.drop("a").drop("b").count
> {code}
> In both the pre-2.0 and 2.0 releases the returned number is 1 instead of 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-17681) Empty DataFrame with non-zero rows after using drop

Reply via email to