[jira] [Updated] (SPARK-25225) Add support for "List"-Type columns

Yuriy Davygora (JIRA) Mon, 27 Aug 2018 01:02:21 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuriy Davygora updated SPARK-25225:
-----------------------------------
    Description: 
At the moment, Spark Dataframe ArrayType-columns only support all elements of 
the array being of same data type.

At our company, we are currently rewriting old MapReduce code with Spark. One 
of the frequent use-cases is aggregating data into timeseries:

Example input:
{noformat}
ID      date            data
1       2017-01-01      data_1_1
1       2018-02-02      data_1_2
2       2017-03-03      data_2_1
3       2018-04-04      data 2_2
...
{noformat}

Expected outpus:
{noformat}
ID      timeseries
1       [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
2       [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
...
{noformat}

Here, the values in the data column of the input are, in most cases, not 
primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
does not support creating an array column of a string column and a non-string 
column.

We would like to kindly ask you to implement one of the following:

1. Extend ArrayType to support elements of different data type

2. Introduce a new container type (ListType?) which would support elements of 
different type

UPDATE: The background here is, that I want to be able to parse JSON-arrays of 
differently-typed elements into SPARK Dataframe columns, as well as create JSON 
arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]]

  was:
At the moment, Spark Dataframe ArrayType-columns only support all elements of 
the array being of same data type.

At our company, we are currently rewriting old MapReduce code with Spark. One 
of the frequent use-cases is aggregating data into timeseries:

Example input:
{noformat}
ID      date            data
1       2017-01-01      data_1_1
1       2018-02-02      data_1_2
2       2017-03-03      data_2_1
3       2018-04-04      data 2_2
...
{noformat}

Expected outpus:
{noformat}
ID      timeseries
1       [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
2       [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
...
{noformat}

Here, the values in the data column of the input are, in most cases, not 
primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
does not support creating an array column of a string column and a non-string 
column.

We would like to kindly ask you to implement one of the following:

1. Extend ArrayType to support elements of different data type

2. Introduce a new container type (ListType?) which would support elements of 
different type

UPDATE: The background here is, that I want to be able to parse JSON-arrays of 
differently-typed elements into SPARK Dataframe columns, as well as create JSON 
arrays from such columns.


> Add support for "List"-Type columns
> -----------------------------------
>
>                 Key: SPARK-25225
>                 URL: https://issues.apache.org/jira/browse/SPARK-25225
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Yuriy Davygora
>            Priority: Minor
>
> At the moment, Spark Dataframe ArrayType-columns only support all elements of 
> the array being of same data type.
> At our company, we are currently rewriting old MapReduce code with Spark. One 
> of the frequent use-cases is aggregating data into timeseries:
> Example input:
> {noformat}
> ID    date            data
> 1     2017-01-01      data_1_1
> 1     2018-02-02      data_1_2
> 2     2017-03-03      data_2_1
> 3     2018-04-04      data 2_2
> ...
> {noformat}
> Expected outpus:
> {noformat}
> ID    timeseries
> 1     [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
> 2     [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
> ...
> {noformat}
> Here, the values in the data column of the input are, in most cases, not 
> primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
> does not support creating an array column of a string column and a non-string 
> column.
> We would like to kindly ask you to implement one of the following:
> 1. Extend ArrayType to support elements of different data type
> 2. Introduce a new container type (ListType?) which would support elements of 
> different type
> UPDATE: The background here is, that I want to be able to parse JSON-arrays 
> of differently-typed elements into SPARK Dataframe columns, as well as create 
> JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25225) Add support for "List"-Type columns

Reply via email to