[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

Philip Adetiloye (JIRA) Wed, 26 Apr 2017 02:18:36 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Philip Adetiloye updated SPARK-20470:
-------------------------------------
    Description: 
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{code}
{
  "feature": "feature_id_001",
  "histogram": [
    {
      "start": 1.9796095151877942,
      "y": 968.0,
      "width": 0.1564485056196041
    },
    {
      "start": 2.1360580208073983,
      "y": 892.0,
      "width": 0.1564485056196041
    },
    {
      "start": 2.2925065264270024,
      "y": 814.0,
      "width": 0.15644850561960366
    },
    {
      "start": 2.448955032046606,
      "y": 690.0,
      "width": 0.1564485056196041
    }]
}
{code}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- start: double (nullable = true)
  |    |    |-- width: double (nullable = true)
  |    |    |-- y: double (nullable = true)
{code}

Need to convert each row to json now and save to HBase 
{code}
    rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict())))
{code}

Output JSON (Wrong)
{code}
{
  "feature": "feature_id_001",
  "histogram": [
    [
      1.9796095151877942,
      968.0,
      0.1564485056196041
    ],
    [
      2.1360580208073983,
      892.0,
      0.1564485056196041
    ],
    [
      2.2925065264270024,
      814.0,
      0.15644850561960366
    ],
    [
      2.448955032046606,
      690.0,
      0.1564485056196041
    ]
}
{code}


  was:
Trying to convert an RDD in pyspark containing Array of struct doesn't generate 
the right json. It looks trivial but can't get a good json out.

I read the json below into a dataframe:
{code}
{
  "feature": "feature_id_001",
  "histogram": [
    {
      "start": 1.9796095151877942,
      "y": 968.0,
      "width": 0.1564485056196041
    },
    {
      "start": 2.1360580208073983,
      "y": 892.0,
      "width": 0.1564485056196041
    },
    {
      "start": 2.2925065264270024,
      "y": 814.0,
      "width": 0.15644850561960366
    },
    {
      "start": 2.448955032046606,
      "y": 690.0,
      "width": 0.1564485056196041
    }]
}
{code}

Df schema looks good 

{code}
 root
  |-- feature: string (nullable = true)
  |-- histogram: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- start: double (nullable = true)
  |    |    |-- width: double (nullable = true)
  |    |    |-- y: double (nullable = true)
{code}
Need to convert each row to json now and save to HBase 
    rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict())))

Output JSON (Wrong)
{code}
{
  "feature": "feature_id_001",
  "histogram": [
    [
      1.9796095151877942,
      968.0,
      0.1564485056196041
    ],
    [
      2.1360580208073983,
      892.0,
      0.1564485056196041
    ],
    [
      2.2925065264270024,
      814.0,
      0.15644850561960366
    ],
    [
      2.448955032046606,
      690.0,
      0.1564485056196041
    ]
}
{code}



> Invalid json converting RDD row with Array of struct to json
> ------------------------------------------------------------
>
>                 Key: SPARK-20470
>                 URL: https://issues.apache.org/jira/browse/SPARK-20470
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.3
>            Reporter: Philip Adetiloye
>
> Trying to convert an RDD in pyspark containing Array of struct doesn't 
> generate the right json. It looks trivial but can't get a good json out.
> I read the json below into a dataframe:
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
>     {
>       "start": 1.9796095151877942,
>       "y": 968.0,
>       "width": 0.1564485056196041
>     },
>     {
>       "start": 2.1360580208073983,
>       "y": 892.0,
>       "width": 0.1564485056196041
>     },
>     {
>       "start": 2.2925065264270024,
>       "y": 814.0,
>       "width": 0.15644850561960366
>     },
>     {
>       "start": 2.448955032046606,
>       "y": 690.0,
>       "width": 0.1564485056196041
>     }]
> }
> {code}
> Df schema looks good 
> {code}
>  root
>   |-- feature: string (nullable = true)
>   |-- histogram: array (nullable = true)
>   |    |-- element: struct (containsNull = true)
>   |    |    |-- start: double (nullable = true)
>   |    |    |-- width: double (nullable = true)
>   |    |    |-- y: double (nullable = true)
> {code}
> Need to convert each row to json now and save to HBase 
> {code}
>     rdd1 = rdd.map(lambda row: Row(x = json.dumps(row.asDict())))
> {code}
> Output JSON (Wrong)
> {code}
> {
>   "feature": "feature_id_001",
>   "histogram": [
>     [
>       1.9796095151877942,
>       968.0,
>       0.1564485056196041
>     ],
>     [
>       2.1360580208073983,
>       892.0,
>       0.1564485056196041
>     ],
>     [
>       2.2925065264270024,
>       814.0,
>       0.15644850561960366
>     ],
>     [
>       2.448955032046606,
>       690.0,
>       0.1564485056196041
>     ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-20470) Invalid json converting RDD row with Array of struct to json

Reply via email to