[ 
https://issues.apache.org/jira/browse/SPARK-13516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiri Syrovy updated SPARK-13516:
--------------------------------
    Description: 
Seems that subsequent Aggregation + Adding static column + Union + Projection 
causes DataFrame inconsistency. 

The problem appears in the following case:

- Let's have DataFrame called df. Then the problem appears after the following 
sequence of steps:

# Aggregation of multiple columns on the Dataframe df and store result as 
result_agg_1
# Do another aggregation of multiple columns, but on one less grouping columns 
and store the result as result_agg_2
# Align the result of second aggregation by adding missing grouping column with 
value empty lit("")
# Union result_agg_1 and result_agg_2
# Do the projection from "sum(count_column)" to "count_column" for all 
aggregated columns.

The result is inconsistent DataFrame that has all data coming from result_agg_1 
shifted.

An example of stripped down code and example result can be seen here:

https://gist.github.com/xjrk58/e0c7171287ee9bdc8df8
https://gist.github.com/xjrk58/7a297a42ebb94f300d96

  was:
Seems that subsequent Aggregation + Adding static column + Union + Projection 
causes DataFrame inconsistency. 

The problem appears int  the following case:

- Let's have some DataFrame called df.

1) Aggregation of multiple columns on the Dataframe df and store result as 
result_agg_1
2) Do another aggregation of multiple columns, but on one less grouping columns 
and store the result as result_agg_2
3) Align the result of second aggregation by adding missing grouping column 
with value empty lit("")
4) Union result_agg_1 and result_agg_2
5) Do the projection from "sum(count_column)" to "count_column" for all 
aggregated columns.

The result is structurally inconsistent DataFrame that has all the data coming 
from result_agg_1 shifted.

An example of stripped down code and example result can be seen here:

https://gist.github.com/xjrk58/e0c7171287ee9bdc8df8
https://gist.github.com/xjrk58/7a297a42ebb94f300d96


> Dataframe inconsistency after aggregation+union+projection.
> -----------------------------------------------------------
>
>                 Key: SPARK-13516
>                 URL: https://issues.apache.org/jira/browse/SPARK-13516
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API, SQL
>    Affects Versions: 1.6.0
>         Environment: Local mode, java version 1.8.0_45
>            Reporter: Jiri Syrovy
>
> Seems that subsequent Aggregation + Adding static column + Union + Projection 
> causes DataFrame inconsistency. 
> The problem appears in the following case:
> - Let's have DataFrame called df. Then the problem appears after the 
> following sequence of steps:
> # Aggregation of multiple columns on the Dataframe df and store result as 
> result_agg_1
> # Do another aggregation of multiple columns, but on one less grouping 
> columns and store the result as result_agg_2
> # Align the result of second aggregation by adding missing grouping column 
> with value empty lit("")
> # Union result_agg_1 and result_agg_2
> # Do the projection from "sum(count_column)" to "count_column" for all 
> aggregated columns.
> The result is inconsistent DataFrame that has all data coming from 
> result_agg_1 shifted.
> An example of stripped down code and example result can be seen here:
> https://gist.github.com/xjrk58/e0c7171287ee9bdc8df8
> https://gist.github.com/xjrk58/7a297a42ebb94f300d96



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to