Thanks Mahesh, I came across this example, look like it might give us some directions.
https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example Thanks Anita On 20 June 2014 09:03, maheshtwc <mahesh.padmanab...@twc-contractor.com> wrote: > Unfortunately, I couldn’t figure it out without involving Avro. > > Here is something that may be useful since it uses Avro generic records > (so no case classes needed) and transforms to Parquet. > > > http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet > / > > HTH, > Mahesh > > From: "Anita Tailor [via Apache Spark User List]" <[hidden email] > <http://user/SendEmail.jtp?type=node&node=7971&i=0>> > > Date: Thursday, June 19, 2014 at 12:53 PM > To: Mahesh Padmanabhan <[hidden email] > <http://user/SendEmail.jtp?type=node&node=7971&i=1>> > > Subject: Re: Spark streaming RDDs to Parquet records > > I have similar case where I have RDD [List[Any], List[Long] ] and wants to > save it as Parquet file. > My understanding is that only RDD of case classes can be converted to > SchemaRDD. So is there any way I can save this RDD as Parquet file without > using Avro? > > Thanks in advance > Anita > > > On 18 June 2014 05:03, Michael Armbrust <[hidden email] > <http://user/SendEmail.jtp?type=node&node=7939&i=0>> wrote: > >> If you convert the data to a SchemaRDD you can save it as Parquet: >> http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet >> >> >> On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <[hidden >> email] <http://user/SendEmail.jtp?type=node&node=7939&i=1>> wrote: >> >>> Thanks Krishna. Seems like you have to use Avro and then convert that to >>> Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look >>> into this some more. >>> >>> Thanks, >>> Mahesh >>> >>> From: Krishna Sankar <[hidden email] >>> <http://user/SendEmail.jtp?type=node&node=7939&i=2>> >>> Reply-To: "[hidden email] >>> <http://user/SendEmail.jtp?type=node&node=7939&i=3>" <[hidden email] >>> <http://user/SendEmail.jtp?type=node&node=7939&i=4>> >>> Date: Tuesday, June 17, 2014 at 2:41 PM >>> To: "[hidden email] <http://user/SendEmail.jtp?type=node&node=7939&i=5>" >>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=7939&i=6>> >>> >>> Subject: Re: Spark streaming RDDs to Parquet records >>> >>> Mahesh, >>> >>> - One direction could be : create a parquet schema, convert & save >>> the records to hdfs. >>> - This might help >>> >>> https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala >>> >>> Cheers >>> <k/> >>> >>> >>> On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <[hidden email] >>> <http://user/SendEmail.jtp?type=node&node=7939&i=7>> wrote: >>> >>>> Hello, >>>> >>>> Is there an easy way to convert RDDs within a DStream into Parquet >>>> records? >>>> Here is some incomplete pseudo code: >>>> >>>> // Create streaming context >>>> val ssc = new StreamingContext(...) >>>> >>>> // Obtain a DStream of events >>>> val ds = KafkaUtils.createStream(...) >>>> >>>> // Get Spark context to get to the SQL context >>>> val sc = ds.context.sparkContext >>>> >>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) >>>> >>>> // For each RDD >>>> ds.foreachRDD((rdd: RDD[Array[Byte]]) => { >>>> >>>> // What do I do next? >>>> }) >>>> >>>> Thanks, >>>> Mahesh >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>> >>> >>> ------------------------------ >>> This E-mail and any of its attachments may contain Time Warner Cable >>> proprietary information, which is privileged, confidential, or subject to >>> copyright belonging to Time Warner Cable. This E-mail is intended solely >>> for the use of the individual or entity to which it is addressed. If you >>> are not the intended recipient of this E-mail, you are hereby notified that >>> any dissemination, distribution, copying, or action taken in relation to >>> the contents of and attachments to this E-mail is strictly prohibited and >>> may be unlawful. If you have received this E-mail in error, please notify >>> the sender immediately and permanently delete the original and any copy of >>> this E-mail and any printout. >>> >> >> > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html > To unsubscribe from Spark streaming RDDs to Parquet records, click here. > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > ------------------------------ > View this message in context: Re: Spark streaming RDDs to Parquet records > <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7971.html> > > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >