[jira] [Commented] (SPARK-25225) Add support for "List"-Type columns

Yuriy Davygora (JIRA) Mon, 27 Aug 2018 01:07:36 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593282#comment-16593282
 ]


Yuriy Davygora commented on SPARK-25225:
----------------------------------------

[~maropu] Sorry, I was not quite clear on the ultimate purpose, which is to 
convert such columns to JSON arrays and also convert JSON arrays to Spark 
Dataframe columns.

[~hyukjin.kwon] Let's say, that in the above example, data is, say, just an 
integer. I coul cast it to string, but then later, when I convert it to JSON it 
will receive quotation marks around it, and I don't want that, I want whatever 
client software will read this JSON, to read an integer and not a string.

Besides, in most cases, 'data' is not a primitive typed column. For example, in 
the code that I am working on right now, it is an array of arrays of integers, 
which cannot be cast to a string that easily.

> Add support for "List"-Type columns
> -----------------------------------
>
>                 Key: SPARK-25225
>                 URL: https://issues.apache.org/jira/browse/SPARK-25225
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Yuriy Davygora
>            Priority: Minor
>
> At the moment, Spark Dataframe ArrayType-columns only support all elements of 
> the array being of same data type.
> At our company, we are currently rewriting old MapReduce code with Spark. One 
> of the frequent use-cases is aggregating data into timeseries:
> Example input:
> {noformat}
> ID    date            data
> 1     2017-01-01      data_1_1
> 1     2018-02-02      data_1_2
> 2     2017-03-03      data_2_1
> 3     2018-04-04      data 2_2
> ...
> {noformat}
> Expected outpus:
> {noformat}
> ID    timeseries
> 1     [[2017-01-01, data_1_1],[2018-02-02, data1_2]]
> 2     [[2017-03-03, data_2_1],[2018-04-04, data2_2]]
> ...
> {noformat}
> Here, the values in the data column of the input are, in most cases, not 
> primitive, but, for example, lists, dicts, nested lists, etc. Spark, however, 
> does not support creating an array column of a string column and a non-string 
> column.
> We would like to kindly ask you to implement one of the following:
> 1. Extend ArrayType to support elements of different data type
> 2. Introduce a new container type (ListType?) which would support elements of 
> different type
> UPDATE: The background here is, that I want to be able to parse JSON-arrays 
> of differently-typed elements into SPARK Dataframe columns, as well as create 
> JSON arrays from such columns. See also [[SPARK-25226]] and [[SPARK-25227]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-25225) Add support for "List"-Type columns

Reply via email to