Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread Anita Tailor
I have similar case where I have RDD [List[Any], List[Long] ] and wants to
save it as Parquet file.
My understanding is that only RDD of case classes can be converted to
SchemaRDD. So is there any way I can save this RDD as Parquet file without
using Avro?

Thanks in advance
Anita


On 18 June 2014 05:03, Michael Armbrust mich...@databricks.com wrote:

 If you convert the data to a SchemaRDD you can save it as Parquet:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


 On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) 
 mahesh.padmanab...@twc-contractor.com wrote:

  Thanks Krishna. Seems like you have to use Avro and then convert that
 to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll
 look into this some more.

  Thanks,
 Mahesh

   From: Krishna Sankar ksanka...@gmail.com
 Reply-To: user@spark.apache.org user@spark.apache.org
 Date: Tuesday, June 17, 2014 at 2:41 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark streaming RDDs to Parquet records

  Mahesh,

- One direction could be : create a parquet schema, convert  save
the records to hdfs.
- This might help

 https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

  Cheers
 k/


 On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc 
 mahesh.padmanab...@twc-contractor.com wrote:

 Hello,

 Is there an easy way to convert RDDs within a DStream into Parquet
 records?
 Here is some incomplete pseudo code:

 // Create streaming context
 val ssc = new StreamingContext(...)

 // Obtain a DStream of events
 val ds = KafkaUtils.createStream(...)

 // Get Spark context to get to the SQL context
 val sc = ds.context.sparkContext

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 // For each RDD
 ds.foreachRDD((rdd: RDD[Array[Byte]]) = {

 // What do I do next?
 })

 Thanks,
 Mahesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



 --
 This E-mail and any of its attachments may contain Time Warner Cable
 proprietary information, which is privileged, confidential, or subject to
 copyright belonging to Time Warner Cable. This E-mail is intended solely
 for the use of the individual or entity to which it is addressed. If you
 are not the intended recipient of this E-mail, you are hereby notified that
 any dissemination, distribution, copying, or action taken in relation to
 the contents of and attachments to this E-mail is strictly prohibited and
 may be unlawful. If you have received this E-mail in error, please notify
 the sender immediately and permanently delete the original and any copy of
 this E-mail and any printout.





Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread contractor
Unfortunately, I couldn’t figure it out without involving Avro.

Here is something that may be useful since it uses Avro generic records (so no 
case classes needed) and transforms to Parquet.

http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/

HTH,
Mahesh

From: Anita Tailor [via Apache Spark User List] 
ml-node+s1001560n7939...@n3.nabble.commailto:ml-node+s1001560n7939...@n3.nabble.com
Date: Thursday, June 19, 2014 at 12:53 PM
To: Mahesh Padmanabhan 
mahesh.padmanab...@twc-contractor.commailto:mahesh.padmanab...@twc-contractor.com
Subject: Re: Spark streaming RDDs to Parquet records

I have similar case where I have RDD [List[Any], List[Long] ] and wants to save 
it as Parquet file.
My understanding is that only RDD of case classes can be converted to 
SchemaRDD. So is there any way I can save this RDD as Parquet file without 
using Avro?

Thanks in advance
Anita


On 18 June 2014 05:03, Michael Armbrust [hidden 
email]/user/SendEmail.jtp?type=nodenode=7939i=0 wrote:
If you convert the data to a SchemaRDD you can save it as Parquet: 
http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) [hidden 
email]/user/SendEmail.jtp?type=nodenode=7939i=1 wrote:
Thanks Krishna. Seems like you have to use Avro and then convert that to 
Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into 
this some more.

Thanks,
Mahesh

From: Krishna Sankar [hidden 
email]/user/SendEmail.jtp?type=nodenode=7939i=2
Reply-To: [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=3 
[hidden email]/user/SendEmail.jtp?type=nodenode=7939i=4
Date: Tuesday, June 17, 2014 at 2:41 PM
To: [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=5 [hidden 
email]/user/SendEmail.jtp?type=nodenode=7939i=6
Subject: Re: Spark streaming RDDs to Parquet records

Mahesh,

 *   One direction could be : create a parquet schema, convert  save the 
records to hdfs.
 *   This might help 
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
k/


On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc [hidden 
email]/user/SendEmail.jtp?type=nodenode=7939i=7 wrote:
Hello,

Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:

// Create streaming context
val ssc = new StreamingContext(...)

// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)

// Get Spark context to get to the SQL context
val sc = ds.context.sparkContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For each RDD
ds.foreachRDD((rdd: RDD[Array[Byte]]) = {

// What do I do next?
})

Thanks,
Mahesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.





If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html
To unsubscribe from Spark streaming RDDs to Parquet records, click 
herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7762code=bWFoZXNoLnBhZG1hbmFiaGFuQHR3Yy1jb250cmFjdG9yLmNvbXw3NzYyfDE3Mjg5ODI4OTI=.
NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread Anita Tailor
Thanks Mahesh,

I came across this example, look like it might give us some directions.

https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example

Thanks
Anita


On 20 June 2014 09:03, maheshtwc mahesh.padmanab...@twc-contractor.com
wrote:

 Unfortunately, I couldn’t figure it out without involving Avro.

 Here is something that may be useful since it uses Avro generic records
 (so no case classes needed) and transforms to Parquet.


 http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet
 /

 HTH,
 Mahesh

 From: Anita Tailor [via Apache Spark User List] [hidden email]
 http://user/SendEmail.jtp?type=nodenode=7971i=0

 Date: Thursday, June 19, 2014 at 12:53 PM
 To: Mahesh Padmanabhan [hidden email]
 http://user/SendEmail.jtp?type=nodenode=7971i=1

 Subject: Re: Spark streaming RDDs to Parquet records

 I have similar case where I have RDD [List[Any], List[Long] ] and wants to
 save it as Parquet file.
 My understanding is that only RDD of case classes can be converted to
 SchemaRDD. So is there any way I can save this RDD as Parquet file without
 using Avro?

 Thanks in advance
 Anita


 On 18 June 2014 05:03, Michael Armbrust [hidden email]
 http://user/SendEmail.jtp?type=nodenode=7939i=0 wrote:

 If you convert the data to a SchemaRDD you can save it as Parquet:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


 On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) [hidden
 email] http://user/SendEmail.jtp?type=nodenode=7939i=1 wrote:

 Thanks Krishna. Seems like you have to use Avro and then convert that to
 Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look
 into this some more.

 Thanks,
 Mahesh

 From: Krishna Sankar [hidden email]
 http://user/SendEmail.jtp?type=nodenode=7939i=2
 Reply-To: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=7939i=3 [hidden email]
 http://user/SendEmail.jtp?type=nodenode=7939i=4
 Date: Tuesday, June 17, 2014 at 2:41 PM
 To: [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=5
 [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=6

 Subject: Re: Spark streaming RDDs to Parquet records

 Mahesh,

- One direction could be : create a parquet schema, convert  save
the records to hdfs.
- This might help

 https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

 Cheers
 k/


 On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc [hidden email]
 http://user/SendEmail.jtp?type=nodenode=7939i=7 wrote:

 Hello,

 Is there an easy way to convert RDDs within a DStream into Parquet
 records?
 Here is some incomplete pseudo code:

 // Create streaming context
 val ssc = new StreamingContext(...)

 // Obtain a DStream of events
 val ds = KafkaUtils.createStream(...)

 // Get Spark context to get to the SQL context
 val sc = ds.context.sparkContext

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 // For each RDD
 ds.foreachRDD((rdd: RDD[Array[Byte]]) = {

 // What do I do next?
 })

 Thanks,
 Mahesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



 --
 This E-mail and any of its attachments may contain Time Warner Cable
 proprietary information, which is privileged, confidential, or subject to
 copyright belonging to Time Warner Cable. This E-mail is intended solely
 for the use of the individual or entity to which it is addressed. If you
 are not the intended recipient of this E-mail, you are hereby notified that
 any dissemination, distribution, copying, or action taken in relation to
 the contents of and attachments to this E-mail is strictly prohibited and
 may be unlawful. If you have received this E-mail in error, please notify
 the sender immediately and permanently delete the original and any copy of
 this E-mail and any printout.





 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html
  To unsubscribe from Spark streaming RDDs to Parquet records, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml

 --
 View this message in context: Re: Spark streaming RDDs to Parquet records
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7971.html

 Sent from

Spark streaming RDDs to Parquet records

2014-06-17 Thread maheshtwc
Hello,

Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:

// Create streaming context
val ssc = new StreamingContext(...)

// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)

// Get Spark context to get to the SQL context
val sc = ds.context.sparkContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For each RDD
ds.foreachRDD((rdd: RDD[Array[Byte]]) = {

// What do I do next?
})

Thanks,
Mahesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Krishna Sankar
Mahesh,

   - One direction could be : create a parquet schema, convert  save the
   records to hdfs.
   - This might help
   
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
k/


On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc 
mahesh.padmanab...@twc-contractor.com wrote:

 Hello,

 Is there an easy way to convert RDDs within a DStream into Parquet records?
 Here is some incomplete pseudo code:

 // Create streaming context
 val ssc = new StreamingContext(...)

 // Obtain a DStream of events
 val ds = KafkaUtils.createStream(...)

 // Get Spark context to get to the SQL context
 val sc = ds.context.sparkContext

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 // For each RDD
 ds.foreachRDD((rdd: RDD[Array[Byte]]) = {

 // What do I do next?
 })

 Thanks,
 Mahesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread contractor
Thanks Krishna. Seems like you have to use Avro and then convert that to 
Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into 
this some more.

Thanks,
Mahesh

From: Krishna Sankar ksanka...@gmail.commailto:ksanka...@gmail.com
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Tuesday, June 17, 2014 at 2:41 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark streaming RDDs to Parquet records

Mahesh,

 *   One direction could be : create a parquet schema, convert  save the 
records to hdfs.
 *   This might help 
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
k/


On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc 
mahesh.padmanab...@twc-contractor.commailto:mahesh.padmanab...@twc-contractor.com
 wrote:
Hello,

Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:

// Create streaming context
val ssc = new StreamingContext(...)

// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)

// Get Spark context to get to the SQL context
val sc = ds.context.sparkContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For each RDD
ds.foreachRDD((rdd: RDD[Array[Byte]]) = {

// What do I do next?
})

Thanks,
Mahesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.


Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Michael Armbrust
If you convert the data to a SchemaRDD you can save it as Parquet:
http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) 
mahesh.padmanab...@twc-contractor.com wrote:

  Thanks Krishna. Seems like you have to use Avro and then convert that to
 Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look
 into this some more.

  Thanks,
 Mahesh

   From: Krishna Sankar ksanka...@gmail.com
 Reply-To: user@spark.apache.org user@spark.apache.org
 Date: Tuesday, June 17, 2014 at 2:41 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark streaming RDDs to Parquet records

  Mahesh,

- One direction could be : create a parquet schema, convert  save the
records to hdfs.
- This might help

 https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

  Cheers
 k/


 On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc 
 mahesh.padmanab...@twc-contractor.com wrote:

 Hello,

 Is there an easy way to convert RDDs within a DStream into Parquet
 records?
 Here is some incomplete pseudo code:

 // Create streaming context
 val ssc = new StreamingContext(...)

 // Obtain a DStream of events
 val ds = KafkaUtils.createStream(...)

 // Get Spark context to get to the SQL context
 val sc = ds.context.sparkContext

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 // For each RDD
 ds.foreachRDD((rdd: RDD[Array[Byte]]) = {

 // What do I do next?
 })

 Thanks,
 Mahesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



 --
 This E-mail and any of its attachments may contain Time Warner Cable
 proprietary information, which is privileged, confidential, or subject to
 copyright belonging to Time Warner Cable. This E-mail is intended solely
 for the use of the individual or entity to which it is addressed. If you
 are not the intended recipient of this E-mail, you are hereby notified that
 any dissemination, distribution, copying, or action taken in relation to
 the contents of and attachments to this E-mail is strictly prohibited and
 may be unlawful. If you have received this E-mail in error, please notify
 the sender immediately and permanently delete the original and any copy of
 this E-mail and any printout.