[jira] [Commented] (SPARK-21316) Dataset Union output is not consistent with the column sequence

Kaushal Prajapati (JIRA) Thu, 06 Jul 2017 03:44:19 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076337#comment-16076337
 ]


Kaushal Prajapati commented on SPARK-21316:
-------------------------------------------

[~dongjoon] It works when the column names are specified in alphabetical order.

What if my column names and schema are not in sync with the order, in that case 
the output again changes. Also in such cases it'll be too cumbersome to handle 
column sequence and their alphabetical order. If I change the 'name' to 'a' and 
'age' to 'b', the above code works as the columns now are in alphabetical order.

Adding to this, when both my datasets are of the same schema(Person.class), 
then why the column order should be even considered while talking the union. 
According to my understanding It should not be considered. 

{code:java}
ds1.select("name","age").as(Encoders.bean(Person.class)).union(ds2).show();
{code}
In above snippet, I'm creating a dataset of rows using column selection and 
then again converting back to Person.class schema. So the order should not 
matter.

> Dataset Union output is not consistent with the column sequence
> ---------------------------------------------------------------
>
>                 Key: SPARK-21316
>                 URL: https://issues.apache.org/jira/browse/SPARK-21316
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 2.1.0
>            Reporter: Kaushal Prajapati
>              Labels: patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> if i take union of 2 datasets with similar schema, the output should remain 
> same even if i change the sequence of columns while creating the dataset. 
> i am attaching the code snippet for details.
> {code:java}
> public class Person{
>   public String name;
>   public String age;
>   public Person(String name, String age) {
>     this.name = name;
>     this.age = age;
>   }
>   public String getName() {return name;}
>   public void setName(String name) {this.name = name;}
>   public String getAge() {return age;}
>   public void setAge(String age) {this.age = age;}
> }
> {code}
> {code:java}
> public class Test {
>   public static void main(String arg[]) throws Exception {
>     SparkSession spark = SparkConnection.getSpark();
>     List<Person> list1 = new ArrayList<>();
>     list1.add(new Person("kaushal", "25"));
>     list1.add(new Person("aman", "26"));
>     List<Person> list2 = new ArrayList<>();
>     list2.add(new Person("sapan", "25"));
>     list2.add(new Person("yati", "26"));
>     Dataset<Person> ds1 = spark.createDataset(list1, 
> Encoders.bean(Person.class));
>     Dataset<Person> ds2 = spark.createDataset(list2, 
> Encoders.bean(Person.class));
>     ds1.show();
>     ds2.show();
>     
> ds1.select("name","age").as(Encoders.bean(Person.class)).union(ds2).show();
>   }
> }
> {code}
> output :-
> {code:java}
> +---+-------+
> |age|   name|
> +---+-------+
> | 25|kaushal|
> | 26|   aman|
> +---+-------+
> +---+-----+
> |age| name|
> +---+-----+
> | 25|sapan|
> | 26| yati|
> +---+-----+
> +-------+-----+
> |   name|  age|
> +-------+-----+
> |kaushal|   25|
> |   aman|   26|
> |     25|sapan|
> |     26| yati|
> +-------+-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-21316) Dataset Union output is not consistent with the column sequence

Reply via email to