[
https://issues.apache.org/jira/browse/SPARK-17681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528639#comment-15528639
]
Ian Hellstrom edited comment on SPARK-17681 at 9/28/16 6:55 AM:
----------------------------------------------------------------
I understand where the logic comes from, especially when you think about a
{{DataFrame}} as a {{List[TupleN]}}:
{code}
List((1, "foo"), (2, "bar")) # drop both 'columns' leads to...
List((), ()) # List.length = 2
{code}
You could argue that you now have an empty table with rows or a non-empty table
without any rows. Both sound distinctly suspicious to my ears.
What I don't subscribe to is that the order does not matter. You implicitly
assume that dropping columns and adding must be a commutative operation, from
which it follows that the number of rows should not change. I fail to see why.
Renaming, for instance, is definitely not commutative, i.e. it depends on the
order in which the operations are performed, at least if you assume no-op
renaming (which is an identity operation when a column you want to rename does
not exist or the target identifier would cause a naming clash).
[Codd|http://dl.acm.org/citation.cfm?doid=362384.362685] does not explicitly
say anything about the number of elements a tuple (i.e. row) is supposed to
have (n but he does not provide a lower bound), but the SQL standard (ISO/IEC
13249-1:2007 Part 1: Framework) is quite clear:
>> A table is a collection of zero or more rows where each row is a sequence of
>> one or more column values.
Most major RDBMSs also do not allow you to drop all columns from a table.
Example: [Oracle|http://psoug.org/oraerror/ORA-12983.htm].
Moreover, I don't think that {{withColumn}} actually ought to increase the
number of rows, because it already doesn't do that when you have a truly empty
data frame:
{code}
import org.apache.spark.sql.functions._
ss.emptyDataFrame.withColumn("x", expr("1")).count # equals 0
{code}
What it does is add a fixed expression to all rows that are present. If no rows
are present, then it simply adds a placeholder for the column.
So, my suggestion is that either a) the behaviour changes when all columns are
dropped (leading to an empty data frame or data set) or b) the documentation
must clearly state that dropping all columns from a {{DataFrame}} or
{{Dataset}} does not lead to the rows being dropped even though they have
absolutely no content (i.e. they are not even {{NULL}}). I fully understand
that a {{DataFrame}} or {{Dataset}} is not a relational table, but since Spark
SQL is often used with relational databases, this is behaviour that people may
find surprising, at least I do.
I am aware that in the case of a change in the behaviour, it may be surprising
to people who think about these data structures in terms of lists (or
multisets) of tuples.
was (Author: hellstorm):
I understand where the logic comes from, especially when you think about a
{{DataFrame}} as a {{List[TupleN]}}:
{code}
List((1, "foo"), (2, "bar")) # drop both 'columns' leads to...
List((), ()) # List.length = 2
{code}
You could argue that you now have an empty table with rows or a non-empty table
without any rows. Both sound distinctly suspicious to my ears.
What I don't subscribe to is that the order does not matter. You implicitly
assume that dropping columns and adding must be a commutative operation, from
which it follows that the number of rows should not change. I fail to see why.
Renaming, for instance, is definitely not commutative, i.e. it depends on the
order in which the operations are performed, at least if you assume no-op
renaming (which is an identity operation when a column you want to rename does
not exist or the target identifier would cause a naming clash).
[Codd|http://dl.acm.org/citation.cfm?doid=362384.362685] does not explicitly
say anything about the number of elements a tuple (i.e. row) is supposed to
have (n but he does not provide a lower bound), but the SQL standard (ISO/IEC
13249-1:2007 Part 1: Framework) is quite clear:
> A table is a collection of zero or more rows where each row is a sequence of
> one or more column values.
Most major RDBMSs also do not allow you to drop all columns from a table.
Example: [Oracle|http://psoug.org/oraerror/ORA-12983.htm].
Moreover, I don't think that {{withColumn}} actually ought to increase the
number of rows, because it already doesn't do that when you have a truly empty
data frame:
{code}
import org.apache.spark.sql.functions._
ss.emptyDataFrame.withColumn("x", expr("1")).count # equals 0
{code}
What it does is add a fixed expression to all rows that are present. If no rows
are present, then it simply adds a placeholder for the column.
So, my suggestion is that either a) the behaviour changes when all columns are
dropped (leading to an empty data frame or data set) or b) the documentation
must clearly state that dropping all columns from a {{DataFrame}} or
{{Dataset}} does not lead to the rows being dropped even though they have
absolutely no content (i.e. they are not even {{NULL}}). I fully understand
that a {{DataFrame}} or {{Dataset}} is not a relational table, but since Spark
SQL is often used with relational databases, this is behaviour that people may
find surprising, at least I do.
I am aware that in the case of a change in the behaviour, it may be surprising
to people who think about these data structures in terms of lists (or
multisets) of tuples.
> Empty DataFrame with non-zero rows after using drop
> ---------------------------------------------------
>
> Key: SPARK-17681
> URL: https://issues.apache.org/jira/browse/SPARK-17681
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.1, 1.6.0, 2.0.0
> Reporter: Ian Hellstrom
>
> It is possible to have a {{DataFrame}} with no columns to have a non-zero
> number of rows, even though the contents are empty:
> {code}
> val df = Seq((1,2)).toDF("a", "b")
> df.drop("a").drop("b").count
> {code}
> The problem is also present in 2.0.0:
> {code}
> import org.apache.spark._
> import org.apache.spark.sql._
> val conf = new SparkConf()
> val sc = new SparkContext("local", "demo", conf)
> val ss = SparkSession.builder.getOrCreate()
> import ss.implicits._
> case class Data(a: Int, b: Int)
> val rdd = sc.parallelize(List(Data(1,2)))
> val ds = ss.createDataset(rdd)
> ds.drop("a").drop("b").count
> {code}
> In both the pre-2.0 and 2.0 releases the returned number is 1 instead of 0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]