Re: Spark streaming RDDs to Parquet records
I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet file. My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there any way I can save this RDD as Parquet file without using Avro? Thanks in advance Anita On 18 June 2014 05:03, Michael Armbrust mich...@databricks.com wrote: If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) mahesh.padmanab...@twc-contractor.com wrote: Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar ksanka...@gmail.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Tuesday, June 17, 2014 at 2:41 PM To: user@spark.apache.org user@spark.apache.org Subject: Re: Spark streaming RDDs to Parquet records Mahesh, - One direction could be : create a parquet schema, convert save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc mahesh.padmanab...@twc-contractor.com wrote: Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL context val sc = ds.context.sparkContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) // For each RDD ds.foreachRDD((rdd: RDD[Array[Byte]]) = { // What do I do next? }) Thanks, Mahesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Re: Spark streaming RDDs to Parquet records
Unfortunately, I couldn’t figure it out without involving Avro. Here is something that may be useful since it uses Avro generic records (so no case classes needed) and transforms to Parquet. http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/ HTH, Mahesh From: Anita Tailor [via Apache Spark User List] ml-node+s1001560n7939...@n3.nabble.commailto:ml-node+s1001560n7939...@n3.nabble.com Date: Thursday, June 19, 2014 at 12:53 PM To: Mahesh Padmanabhan mahesh.padmanab...@twc-contractor.commailto:mahesh.padmanab...@twc-contractor.com Subject: Re: Spark streaming RDDs to Parquet records I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet file. My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there any way I can save this RDD as Parquet file without using Avro? Thanks in advance Anita On 18 June 2014 05:03, Michael Armbrust [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=0 wrote: If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=1 wrote: Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=2 Reply-To: [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=3 [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=4 Date: Tuesday, June 17, 2014 at 2:41 PM To: [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=5 [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=6 Subject: Re: Spark streaming RDDs to Parquet records Mahesh, * One direction could be : create a parquet schema, convert save the records to hdfs. * This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc [hidden email]/user/SendEmail.jtp?type=nodenode=7939i=7 wrote: Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL context val sc = ds.context.sparkContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) // For each RDD ds.foreachRDD((rdd: RDD[Array[Byte]]) = { // What do I do next? }) Thanks, Mahesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html To unsubscribe from Spark streaming RDDs to Parquet records, click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7762code=bWFoZXNoLnBhZG1hbmFiaGFuQHR3Yy1jb250cmFjdG9yLmNvbXw3NzYyfDE3Mjg5ODI4OTI=. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
Re: Spark streaming RDDs to Parquet records
Thanks Mahesh, I came across this example, look like it might give us some directions. https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example Thanks Anita On 20 June 2014 09:03, maheshtwc mahesh.padmanab...@twc-contractor.com wrote: Unfortunately, I couldn’t figure it out without involving Avro. Here is something that may be useful since it uses Avro generic records (so no case classes needed) and transforms to Parquet. http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet / HTH, Mahesh From: Anita Tailor [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=7971i=0 Date: Thursday, June 19, 2014 at 12:53 PM To: Mahesh Padmanabhan [hidden email] http://user/SendEmail.jtp?type=nodenode=7971i=1 Subject: Re: Spark streaming RDDs to Parquet records I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet file. My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there any way I can save this RDD as Parquet file without using Avro? Thanks in advance Anita On 18 June 2014 05:03, Michael Armbrust [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=0 wrote: If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=1 wrote: Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=2 Reply-To: [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=3 [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=4 Date: Tuesday, June 17, 2014 at 2:41 PM To: [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=5 [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=6 Subject: Re: Spark streaming RDDs to Parquet records Mahesh, - One direction could be : create a parquet schema, convert save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc [hidden email] http://user/SendEmail.jtp?type=nodenode=7939i=7 wrote: Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL context val sc = ds.context.sparkContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) // For each RDD ds.foreachRDD((rdd: RDD[Array[Byte]]) = { // What do I do next? }) Thanks, Mahesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html To unsubscribe from Spark streaming RDDs to Parquet records, click here. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Spark streaming RDDs to Parquet records http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7971.html Sent from
Spark streaming RDDs to Parquet records
Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL context val sc = ds.context.sparkContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) // For each RDD ds.foreachRDD((rdd: RDD[Array[Byte]]) = { // What do I do next? }) Thanks, Mahesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark streaming RDDs to Parquet records
Mahesh, - One direction could be : create a parquet schema, convert save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc mahesh.padmanab...@twc-contractor.com wrote: Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL context val sc = ds.context.sparkContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) // For each RDD ds.foreachRDD((rdd: RDD[Array[Byte]]) = { // What do I do next? }) Thanks, Mahesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark streaming RDDs to Parquet records
Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar ksanka...@gmail.commailto:ksanka...@gmail.com Reply-To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Date: Tuesday, June 17, 2014 at 2:41 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark streaming RDDs to Parquet records Mahesh, * One direction could be : create a parquet schema, convert save the records to hdfs. * This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc mahesh.padmanab...@twc-contractor.commailto:mahesh.padmanab...@twc-contractor.com wrote: Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL context val sc = ds.context.sparkContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) // For each RDD ds.foreachRDD((rdd: RDD[Array[Byte]]) = { // What do I do next? }) Thanks, Mahesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Re: Spark streaming RDDs to Parquet records
If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) mahesh.padmanab...@twc-contractor.com wrote: Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar ksanka...@gmail.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Tuesday, June 17, 2014 at 2:41 PM To: user@spark.apache.org user@spark.apache.org Subject: Re: Spark streaming RDDs to Parquet records Mahesh, - One direction could be : create a parquet schema, convert save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc mahesh.padmanab...@twc-contractor.com wrote: Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL context val sc = ds.context.sparkContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) // For each RDD ds.foreachRDD((rdd: RDD[Array[Byte]]) = { // What do I do next? }) Thanks, Mahesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.