Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2018-01-10 Thread Selvam Raman
I just followed Hien Luu approach

val empExplode = empInfoStrDF.select(explode(from_json('emp_info_str,
  empInfoSchema)).as("emp_info_withexplode"))


empExplode.show(false)

+---+
|emp_info_withexplode   |
+---+
|[foo,[CA,USA],WrappedArray([english,2016])]|
|[bar,[OH,USA],WrappedArray([math,2017])]   |
+---+

empExplode.select($"emp_info_withexplode.name").show(false)


++
|name|
++
|foo |
|bar |
++

empExplode.select($"emp_info_withexplode.address.state").show(false)

+-+
|state|
+-+
|CA   |
|OH   |
+-+

empExplode.select($"emp_info_withexplode.docs.subject").show(false)

+-+
|subject  |
+-+
|[english]|
|[math]   |
+-+


@Kant kodali, is that helpful for you? if not please let me know what
changes are you expecting in this?




On Sun, Jan 7, 2018 at 12:16 AM, Jules Damji  wrote:

> Here’s are couple tutorial that shows how to extract Structured nested
> data
>
> https://databricks.com/blog/2017/06/27/4-sql-high-order-
> lambda-functions-examine-complex-structured-data-databricks.html
>
> https://databricks.com/blog/2017/06/13/five-spark-sql-
> utility-functions-extract-explore-complex-data-types.html
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jan 6, 2018, at 11:42 AM, Hien Luu  wrote:
>
> Hi Kant,
>
> I am not sure whether you had come up with a solution yet, but the
> following
> works for me (in Scala)
>
> val emp_info = """
>  [
>{"name": "foo", "address": {"state": "CA", "country": "USA"},
> "docs":[{"subject": "english", "year": 2016}]},
>{"name": "bar", "address": {"state": "OH", "country": "USA"},
> "docs":[{"subject": "math", "year": 2017}]}
>  ]"""
>
> import org.apache.spark.sql.types._
>
> val addressSchema = new StructType().add("state",
> StringType).add("country",
> StringType)
> val docsSchema = ArrayType(new StructType().add("subject",
> StringType).add("year", IntegerType))
> val employeeSchema = new StructType().add("name",
> StringType).add("address",
> addressSchema).add("docs", docsSchema)
>
> val empInfoSchema = ArrayType(employeeSchema)
>
> empInfoSchema.json
>
> val empInfoStrDF = Seq((emp_info)).toDF("emp_info_str")
> empInfoStrDF.printSchema
> empInfoStrDF.show(false)
>
> val empInfoDF = empInfoStrDF.select(from_json('emp_info_str,
> empInfoSchema).as("emp_info"))
> empInfoDF.printSchema
>
> empInfoDF.select(struct("*")).show(false)
>
> empInfoDF.select("emp_info.name", "emp_info.address",
> "emp_info.docs").show(false)
>
> empInfoDF.select(explode('emp_info.getItem("name"))).show
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2018-01-06 Thread Jules Damji
Here’s are couple tutorial that shows how to extract Structured nested data 

https://databricks.com/blog/2017/06/27/4-sql-high-order-lambda-functions-examine-complex-structured-data-databricks.html

https://databricks.com/blog/2017/06/13/five-spark-sql-utility-functions-extract-explore-complex-data-types.html

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Jan 6, 2018, at 11:42 AM, Hien Luu  wrote:
> 
> Hi Kant,
> 
> I am not sure whether you had come up with a solution yet, but the following
> works for me (in Scala)
> 
> val emp_info = """
>  [ 
>{"name": "foo", "address": {"state": "CA", "country": "USA"},
> "docs":[{"subject": "english", "year": 2016}]},
>{"name": "bar", "address": {"state": "OH", "country": "USA"},
> "docs":[{"subject": "math", "year": 2017}]} 
>  ]"""
> 
> import org.apache.spark.sql.types._
> 
> val addressSchema = new StructType().add("state", StringType).add("country",
> StringType)
> val docsSchema = ArrayType(new StructType().add("subject",
> StringType).add("year", IntegerType))
> val employeeSchema = new StructType().add("name", StringType).add("address",
> addressSchema).add("docs", docsSchema)
> 
> val empInfoSchema = ArrayType(employeeSchema)
> 
> empInfoSchema.json
> 
> val empInfoStrDF = Seq((emp_info)).toDF("emp_info_str")
> empInfoStrDF.printSchema
> empInfoStrDF.show(false)
> 
> val empInfoDF = empInfoStrDF.select(from_json('emp_info_str,
> empInfoSchema).as("emp_info"))
> empInfoDF.printSchema
> 
> empInfoDF.select(struct("*")).show(false)
> 
> empInfoDF.select("emp_info.name", "emp_info.address",
> "emp_info.docs").show(false)
> 
> empInfoDF.select(explode('emp_info.getItem("name"))).show
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


RE: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2018-01-06 Thread Hien Luu
Hi Kant,

I am not sure whether you had come up with a solution yet, but the following
works for me (in Scala)

val emp_info = """
  [ 
{"name": "foo", "address": {"state": "CA", "country": "USA"},
"docs":[{"subject": "english", "year": 2016}]},
{"name": "bar", "address": {"state": "OH", "country": "USA"},
"docs":[{"subject": "math", "year": 2017}]} 
  ]"""

import org.apache.spark.sql.types._

val addressSchema = new StructType().add("state", StringType).add("country",
StringType)
val docsSchema = ArrayType(new StructType().add("subject",
StringType).add("year", IntegerType))
val employeeSchema = new StructType().add("name", StringType).add("address",
addressSchema).add("docs", docsSchema)

val empInfoSchema = ArrayType(employeeSchema)

empInfoSchema.json

val empInfoStrDF = Seq((emp_info)).toDF("emp_info_str")
empInfoStrDF.printSchema
empInfoStrDF.show(false)

val empInfoDF = empInfoStrDF.select(from_json('emp_info_str,
empInfoSchema).as("emp_info"))
empInfoDF.printSchema

empInfoDF.select(struct("*")).show(false)

empInfoDF.select("emp_info.name", "emp_info.address",
"emp_info.docs").show(false)

empInfoDF.select(explode('emp_info.getItem("name"))).show




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-10 Thread JG Perrin
Something along the line of:

Dataset df = spark.read().json(jsonDf); ?


From: kant kodali [mailto:kanth...@gmail.com]
Sent: Saturday, October 07, 2017 2:31 AM
To: user @spark 
Subject: How to convert Array of Json rows into Dataset of specific columns in 
Spark 2.2.0?


I have a Dataset ds which consists of json rows.

Sample Json Row (This is just an example of one row in the dataset)

[

{"name": "foo", "address": {"state": "CA", "country": "USA"}, 
"docs":[{"subject": "english", "year": 2016}]}

{"name": "bar", "address": {"state": "OH", "country": "USA"}, 
"docs":[{"subject": "math", "year": 2017}]}



]

ds.printSchema()

root

 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  | address   |  docs

--

"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 
2016}]

"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions available 
in Java API


Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-09 Thread kant kodali
https://issues.apache.org/jira/browse/SPARK-8

On Sun, Oct 8, 2017 at 11:58 AM, kant kodali  wrote:

> I have the following so far
>
> private StructType getSchema() {
> return new StructType()
> .add("name", StringType)
> .add("address", StringType)
> .add("docs", StringType);
> }
>
>
> ds.select(explode_outer(from_json(ds.col("value"), 
> ArrayType.apply(getSchema(.as("result")).selectExpr("result.*");
>
> This didn't quite work for me so just to clarify I have Json array of
> documents as my input string
> and I am trying to keep the values of my name, address, docs columns as a
> string as well except
> my input array string is flattened out by explode function.
> Any suggestions will be great
> Thanks!
>
>
> On Sat, Oct 7, 2017 at 10:00 AM, Jules Damji  wrote:
>
>> You might find these blogs helpful to parse & extract data from complex
>> structures:
>>
>> https://databricks.com/blog/2017/06/27/4-sql-high-order-lamb
>> da-functions-examine-complex-structured-data-databricks.html
>>
>> https://databricks.com/blog/2017/06/13/five-spark-sql-utilit
>> y-functions-extract-explore-complex-data-types.html
>>
>> Cheers
>> Jules
>>
>>
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On Oct 7, 2017, at 12:30 AM, kant kodali  wrote:
>>
>> I have a Dataset ds which consists of json rows.
>>
>> *Sample Json Row (This is just an example of one row in the dataset)*
>>
>> [
>> {"name": "foo", "address": {"state": "CA", "country": "USA"}, 
>> "docs":[{"subject": "english", "year": 2016}]}
>> {"name": "bar", "address": {"state": "OH", "country": "USA"}, 
>> "docs":[{"subject": "math", "year": 2017}]}
>>
>> ]
>>
>> ds.printSchema()
>>
>> root
>>  |-- value: string (nullable = true)
>>
>> Now I want to convert into the following dataset using Spark 2.2.0
>>
>> name  | address   |  docs
>> --
>> "foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 
>> 2016}]
>> "bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 
>> 2017}]
>>
>> Preferably Java but Scala is also fine as long as there are functions
>> available in Java API
>>
>>
>


Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-08 Thread kant kodali
I have the following so far

private StructType getSchema() {
return new StructType()
.add("name", StringType)
.add("address", StringType)
.add("docs", StringType);
}


ds.select(explode_outer(from_json(ds.col("value"),
ArrayType.apply(getSchema(.as("result")).selectExpr("result.*");

This didn't quite work for me so just to clarify I have Json array of
documents as my input string
and I am trying to keep the values of my name, address, docs columns as a
string as well except
my input array string is flattened out by explode function.
Any suggestions will be great
Thanks!


On Sat, Oct 7, 2017 at 10:00 AM, Jules Damji  wrote:

> You might find these blogs helpful to parse & extract data from complex
> structures:
>
> https://databricks.com/blog/2017/06/27/4-sql-high-order-
> lambda-functions-examine-complex-structured-data-databricks.html
>
> https://databricks.com/blog/2017/06/13/five-spark-sql-
> utility-functions-extract-explore-complex-data-types.html
>
> Cheers
> Jules
>
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Oct 7, 2017, at 12:30 AM, kant kodali  wrote:
>
> I have a Dataset ds which consists of json rows.
>
> *Sample Json Row (This is just an example of one row in the dataset)*
>
> [
> {"name": "foo", "address": {"state": "CA", "country": "USA"}, 
> "docs":[{"subject": "english", "year": 2016}]}
> {"name": "bar", "address": {"state": "OH", "country": "USA"}, 
> "docs":[{"subject": "math", "year": 2017}]}
>
> ]
>
> ds.printSchema()
>
> root
>  |-- value: string (nullable = true)
>
> Now I want to convert into the following dataset using Spark 2.2.0
>
> name  | address   |  docs
> --
> "foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 
> 2016}]
> "bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 
> 2017}]
>
> Preferably Java but Scala is also fine as long as there are functions
> available in Java API
>
>


Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread Jules Damji
You might find these blogs helpful to parse & extract data from complex 
structures:

https://databricks.com/blog/2017/06/27/4-sql-high-order-lambda-functions-examine-complex-structured-data-databricks.html

https://databricks.com/blog/2017/06/13/five-spark-sql-utility-functions-extract-explore-complex-data-types.html

Cheers 
Jules


Sent from my iPhone
Pardon the dumb thumb typos :)

> On Oct 7, 2017, at 12:30 AM, kant kodali  wrote:
> 
> I have a Dataset ds which consists of json rows.
> 
> Sample Json Row (This is just an example of one row in the dataset)
> 
> [ 
> {"name": "foo", "address": {"state": "CA", "country": "USA"}, 
> "docs":[{"subject": "english", "year": 2016}]}
> {"name": "bar", "address": {"state": "OH", "country": "USA"}, 
> "docs":[{"subject": "math", "year": 2017}]}
> 
> ]
> ds.printSchema()
> 
> root
>  |-- value: string (nullable = true)
> Now I want to convert into the following dataset using Spark 2.2.0
> 
> name  | address   |  docs 
> --
> "foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 
> 2016}]
> "bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 
> 2017}]
> Preferably Java but Scala is also fine as long as there are functions 
> available in Java API


Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread Matteo Cossu
Hello,
I think you should use *from_json *from spark.sql.functions

to parse the json string and convert it to a StructType. Afterwards, you
can create a new DataSet by selecting the columns you want.

On 7 October 2017 at 09:30, kant kodali  wrote:

> I have a Dataset ds which consists of json rows.
>
> *Sample Json Row (This is just an example of one row in the dataset)*
>
> [
> {"name": "foo", "address": {"state": "CA", "country": "USA"}, 
> "docs":[{"subject": "english", "year": 2016}]}
> {"name": "bar", "address": {"state": "OH", "country": "USA"}, 
> "docs":[{"subject": "math", "year": 2017}]}
>
> ]
>
> ds.printSchema()
>
> root
>  |-- value: string (nullable = true)
>
> Now I want to convert into the following dataset using Spark 2.2.0
>
> name  | address   |  docs
> --
> "foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 
> 2016}]
> "bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 
> 2017}]
>
> Preferably Java but Scala is also fine as long as there are functions
> available in Java API
>


How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread kant kodali
I have a Dataset ds which consists of json rows.

*Sample Json Row (This is just an example of one row in the dataset)*

[
{"name": "foo", "address": {"state": "CA", "country": "USA"},
"docs":[{"subject": "english", "year": 2016}]}
{"name": "bar", "address": {"state": "OH", "country": "USA"},
"docs":[{"subject": "math", "year": 2017}]}

]

ds.printSchema()

root
 |-- value: string (nullable = true)

Now I want to convert into the following dataset using Spark 2.2.0

name  | address   |  docs
--
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english",
"year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

Preferably Java but Scala is also fine as long as there are functions
available in Java API