what's the way to access the last element from another partition

2015-12-08 Thread Zhiliang Zhu
In some given partition, it seems difficult to access the last element in another partition, but in my application I need do as that.Exactly how to do it ?  Just by repartition /shuffle  the rdd into one partition and get the specific "last" element ? Will this will change the previous order

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
what’s your data format? ORC or CSV or others? val keys = sqlContext.read.orc(“your previous batch data path”).select($”uniq_key”).collect val broadCast = sc.broadCast(keys) val rdd = your_current_batch_data rdd.filter( line => line.key not in broadCase.value) > On Dec 8, 2015, at 4:44

Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Im running spark batch job in cluster mode every hour and it runs for 15 minutes. I have certain unique keys in the dataset. i dont want to process those keys during my next hour batch. *Thanks*, On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu

Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Pipe separated value. I know broadcast and join works. but i would like to know mapDB works or not ? *Thanks*, On Tue, Dec 8, 2015 at 2:22 PM, Fengdong Yu wrote: > > what’s your data format? ORC or CSV or others? > > val keys

Re: parquet file doubts

2015-12-08 Thread Cheng Lian
Cc'd Parquet dev list. At first I expected to discuss this issue on Parquet dev list but sent to the wrong mailing list. However, I think it's OK to discuss it here since lots of Spark users are using Parquet and this information should be generally useful here. Comments inlined. On 12/7/15

bad performance on PySpark - big text file

2015-12-08 Thread patcharee
Hi, I am very new to PySpark. I have a PySpark app working on text files with different size (100M - 100G). However each task is handling the same size of input split. But workers spend very much longer time on some input splits, especially when the input splits belong to a big file. See the

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
Can you detail your question? what looks like your previous batch and the current batch? > On Dec 8, 2015, at 3:52 PM, Ramkumar V wrote: > > Hi, > > I'm running java over spark in cluster mode. I want to apply filter on > javaRDD based on some previous batch

Re: Spark with MapDB

2015-12-08 Thread Jörn Franke
You may want to use a bloom filter for this, but make sure that you understand how it works > On 08 Dec 2015, at 09:44, Ramkumar V wrote: > > Im running spark batch job in cluster mode every hour and it runs for 15 > minutes. I have certain unique keys in the dataset.

Logging spark output to hdfs file

2015-12-08 Thread sunil m
Hi! I configured log4j.properties file in conf folder of spark with following values... log4j.appender.file.File=hdfs:// I expected all log files to log output to the file in HDFS. Instead files are created locally. Has anybody tried logging to HDFS by configuring log4j.properties? Warm

Re: HiveContext creation failed with Kerberos

2015-12-08 Thread Steve Loughran
On 8 Dec 2015, at 06:52, Neal Yin > wrote: 15/12/08 04:12:28 ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism

Re: Logging spark output to hdfs file

2015-12-08 Thread Jörn Franke
This would require a special HDFS log4j appender. Alternatively try the flume log4j appender > On 08 Dec 2015, at 13:00, sunil m <260885smanik...@gmail.com> wrote: > > Hi! > I configured log4j.properties file in conf folder of spark with following > values... > >

Comparisons between Ganglia and Graphite for monitoring the Streaming Cluster?

2015-12-08 Thread SRK
Hi, What are the comparisons between Ganglia and Graphite to monitor the Streaming Cluster? Which one has more advantages over the other? Thanks! -- View this message in context:

is repartition very cost

2015-12-08 Thread Zhiliang Zhu
Hi All, I need to do optimize objective function with some linear constraints by   genetic algorithm. I would like to make as much parallelism for it by spark. repartition / shuffle may be used sometimes in it, however, is repartition API very cost ? Thanks in advance!Zhiliang

Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Yes, I agree but the data is in the form of RDD and also im running it cluster mode so the data should be distributed across all machines in the cluster. but if i use bloom filter or mapDB which is non distributed. How will it works in this case ? *Thanks*,

PySpark reading from Postgres tables with UUIDs

2015-12-08 Thread Chris Elsmore
Hi All, I’m currently having some issues getting Spark to read from Postgres tables which have uuid type columns through a PySpark shell. I can connect and see tables which do not have a uuid column but get the error "java.sql.SQLException: Unsupported type " when I try to get a table

Re: Can not see any spark metrics on ganglia-web

2015-12-08 Thread SRK
Hi, Should the gmond be installed in all the Spark nodes? What should the host and port be? Should it be the host and port of gmetad? Enable GangliaSink for all instances *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink *.sink.ganglia.name=hadoop_cluster1

flatMap function in Spark

2015-12-08 Thread Sateesh Karuturi
Guys... I am new to Spark.. Please anyone please explain me how flatMap function works with a little sample example... Thanks in advance...

Re: flatMap function in Spark

2015-12-08 Thread Gerard Maas
http://stackoverflow.com/search?q=%5Bapache-spark%5D+flatmap -kr, Gerard. On Tue, Dec 8, 2015 at 12:04 PM, Sateesh Karuturi < sateesh.karutu...@gmail.com> wrote: > Guys... I am new to Spark.. > Please anyone please explain me how flatMap function works with a little > sample example... > Thanks

Re: Spark SQL - saving to multiple partitions in parallel - FileNotFoundException on _temporary directory possible bug?

2015-12-08 Thread Jiří Syrový
Hi, I have a very similar issue on standalone SQL context, but when using save() instead. I guess it might be related to https://issues.apache.org/jira/browse/SPARK-8513. Also it usually happens after using groupBy. Regrads, Jiri 2015-12-08 0:16 GMT+01:00 Deenar Toraskar

Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Tao Li
I am using spark streaming kafka direct approach these days. I found that when I start the application, it always start consumer the latest offset. I hope that when application start, it consume from the offset last application consumes with the same kafka consumer group. It means I have to

RE: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Singh, Abhijeet
You need to maintain the offset yourself and rightly so in something like ZooKeeper. From: Tao Li [mailto:litao.bupt...@gmail.com] Sent: Tuesday, December 08, 2015 5:36 PM To: user@spark.apache.org Subject: Need to maintain the consumer offset by myself when using spark streaming kafka direct

Re: understanding and disambiguating CPU-core related properties

2015-12-08 Thread Manolis Sifalakis1
Thanks lots for the pointer! Helpful even though a bit layman's style. (On the nagging end, this information as usual with spark, is not were it is expected to be: neither the book nor the spark doc) m. From: Leonidas Patouchas To: Manolis Sifalakis1

Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Dibyendu Bhattacharya
In direct stream checkpoint location is not recoverable if you modify your driver code. So if you just rely on checkpoint to commit offset , you can possibly loose messages if you modify driver code and you select offset from "largest" offset. If you do not want to loose messages, you need to

Re: hive thriftserver and fair scheduling

2015-12-08 Thread Deenar Toraskar
Thanks Michael, I'll try it out. Another quick/important question: How do I make udfs available to all of the hive thriftserver users? Right now, when I launch a spark-sql client, I notice that it reads the ~/.hiverc file and all udfs get picked up but this doesn't seem to be working in hive

groupByKey()

2015-12-08 Thread Yasemin Kaya
Hi, Sorry for the long inputs but it is my situation. i have two list and i wana grupbykey them but some value of list disapear. i can't understand this. (8867989628612931721,[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

epoch date time problem to load data into in spark

2015-12-08 Thread Soni spark
Hi Friends, I am written a spark streaming program in Java to access twitter tweets and it is working fine. I can able to copy the twitter feeds to HDFS location by batch wise.For each batch, it is creating a folder with epoch time stamp. for example, If i give HDFS location as

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-08 Thread Ewan Higgs
Sean, Thanks. It's a developer API and doesn't appear to be exposed. Ewan On 07/12/15 15:06, Sean Owen wrote: I'm not sure if this is available in Python but from 1.3 on you should be able to call ALS.setFinalRDDStorageLevel with level "none" to ask it to unpersist when it is done. On Mon,

actors and async communication between driver and workers/executors

2015-12-08 Thread Manolis Sifalakis1
I ve been looking around for some examples of information of how can the driver and the executors exchange information asynchronously, but have not found much apart from the ActorWordCount.scala streaming example that uses Akka. Is there any "in-band" (within Spark) method that such

Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread PhuDuc Nguyen
Kafka Receiver-based approach: This will maintain the consumer offsets in ZK for you. Kafka Direct approach: You can use checkpointing and that will maintain consumer offsets for you. You'll want to checkpoint to a highly available file system like HDFS or S3.

Re: epoch date format to normal date format while loading the files to HDFS

2015-12-08 Thread Andy Davidson
Hi Sonia I believe you are using java? Take a look at Java Date I am sure you will find lots of examples of how to format dates Enjoy share Andy /** * saves tweets to disk. This a replacement for * @param tweets * @param outputURI */ private static void

SparkR read.df failed to read file from local directory

2015-12-08 Thread Boyu Zhang
Hello everyone, I tried to run the example data--manipulation.R, and can't get it to read the flights.csv file that is stored in my local fs. I don't want to store big files in my hdfs, so reading from a local fs (lustre fs) is the desired behavior for me. I tried the following: flightsDF <-

Associating spark jobs with logs

2015-12-08 Thread sunil m
Hello Spark experts! I was wondering if somebody has solved the problem which we are facing. We want to achieve the following: Given a spark job id fetch all the logs generated by that job. We looked at spark job server it seems to be lacking such a feature. Any ideas, suggestions are

Re: Can't create UDF's in spark 1.5 while running using the hive thrift service

2015-12-08 Thread Deenar Toraskar
Hi Trystan I am facing the same issue. It only appears with the thrift server, the same call works fine via the spark-sql shell. Do you have any workarounds and have you filed a JIRA/bug for the same? Regards Deenar On 12 October 2015 at 18:01, Trystan Leftwich wrote: >

Re: NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-08 Thread Sunil Tripathy
Thanks Fengdong. I still have the same exception. Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper; at

Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Hi, Anyone can recommend a great Graph visualization tool for GraphX that can handle truly large Data (~ TB) ? Thanks so much Hao Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not

Re: Graph visualization tool for GraphX

2015-12-08 Thread Jörn Franke
I am not sure about your use case. How should a human interpret many terabytes of data in one large visualization?? You have to be more specific, what part of the data needs to be visualized, what kind of visualization, what navigation do you expect within the visualisation, how many users,

Re: can i write only RDD transformation into hdfs or any other storage system

2015-12-08 Thread Ted Yu
Can you clarify your use case ? Apart from hdfs, S3 (and possibly others) can be used. Cheers On Tue, Dec 8, 2015 at 9:40 AM, prateek arora wrote: > Hi > > Is it possible into spark to write only RDD transformation into hdfs or any > other storage system ? > >

RE: Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Hello Jorn, Thank you for the reply and being tolerant of my over simplified question. I should’ve been more specific. Though ~TB of data, there will be about billions of records (edges) and 100,000 nodes. We need to visualize the social networks graph like what can be done by Gephi which has

Merge rows into csv

2015-12-08 Thread Krishna
Hi, what is the most efficient way to perform a group-by operation in Spark and merge rows into csv? Here is the current RDD - ID STATE - 1 TX 1NY 1FL 2CA 2OH - This is the required output:

can i write only RDD transformation into hdfs or any other storage system

2015-12-08 Thread prateek arora
Hi Is it possible into spark to write only RDD transformation into hdfs or any other storage system ? Regards Prateek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637.html

Re: SparkR read.df failed to read file from local directory

2015-12-08 Thread Boyu Zhang
Thanks for the comment Felix, I tried giving "/home/myuser/test_data/sparkR/flights.csv", but it tried to search the path in hdfs, and gave errors: 15/12/08 12:47:10 ERROR r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed Error in invokeJava(isStatic = TRUE, className,

Re: Graph visualization tool for GraphX

2015-12-08 Thread andy petrella
Hello Lin, This is indeed a tough scenario when you have many vertices (and even worst) many edges... So two-fold answer: First, technically, there is a graph plotting support in the spark notebook (https://github.com/andypetrella/spark-notebook/ → check this notebook:

Re: SparkR read.df failed to read file from local directory

2015-12-08 Thread Felix Cheung
Have you tried flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", source = "com.databricks.spark.csv", header = "true")     _ From: Boyu Zhang Sent: Tuesday, December 8, 2015 8:47 AM Subject: SparkR read.df

Spark metrics not working

2015-12-08 Thread Jesse F Chen
v1.5.1. Trying to enable CsvSink for metrics collecting, but I get the following error as soon as kicking off a 'spark-submit' app: 15/12/08 11:24:02 INFO storage.BlockManagerMaster: Registered BlockManager 15/12/08 11:24:02 ERROR metrics.MetricsSystem: Sink class

Re: Associating spark jobs with logs

2015-12-08 Thread Ted Yu
Have you looked at the REST API section of: https://spark.apache.org/docs/latest/monitoring.html FYI On Tue, Dec 8, 2015 at 8:57 AM, sunil m <260885smanik...@gmail.com> wrote: > Hello Spark experts! > > I was wondering if somebody has solved the problem which we are facing. > > We want to

Re: About Spark On Hbase

2015-12-08 Thread censj
Can you get me a example? I want to update base data. > 在 2015年12月9日,15:19,Fengdong Yu 写道: > > https://github.com/nerdammer/spark-hbase-connector > > > This is better and easy to use. > > > > > >> On Dec 9,

回复: Re: About Spark On Hbase

2015-12-08 Thread fightf...@163.com
I don't think it really need CDH component. Just use the API fightf...@163.com 发件人: censj 发送时间: 2015-12-09 15:31 收件人: fightf...@163.com 抄送: user@spark.apache.org 主题: Re: About Spark On Hbase But this is dependent on CDH。I not install CDH。 在 2015年12月9日,15:18,fightf...@163.com 写道: Actually

Re: About Spark On Hbase

2015-12-08 Thread censj
So, I how to get this jar? I use set package project.I not found sbt lib. > 在 2015年12月9日,15:42,fightf...@163.com 写道: > > I don't think it really need CDH component. Just use the API > > fightf...@163.com > > 发件人: censj > 发送时间: 2015-12-09

Re: is repartition very cost

2015-12-08 Thread Zhiliang Zhu
Thanks very much for Yong's help. Sorry that for one more issue, is it that different partitions must be in different nodes? that is, each node would only have one partition, in cluster mode ...  On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T"

Re: spark-defaults.conf optimal configuration

2015-12-08 Thread nsalian
Hi Chris, Thank you for posting the question. Tuning spark configurations is a tricky task since there are a lot factors to consider. The configurations that you listed cover the most them. To understand the situation that can guide you in making a decision about tuning: 1) What kind of spark

Re: Local Mode: Executor thread leak?

2015-12-08 Thread Shixiong Zhu
Could you send a PR to fix it? Thanks! Best Regards, Shixiong Zhu 2015-12-08 13:31 GMT-08:00 Richard Marscher : > Alright I was able to work through the problem. > > So the owning thread was one from the executor task launch worker, which > at least in local mode runs

RE: is repartition very cost

2015-12-08 Thread Young, Matthew T
Shuffling large amounts of data over the network is expensive, yes. The cost is lower if you are just using a single node where no networking needs to be involved to do the repartition (using Spark as a multithreading engine). In general you need to do performance testing to see if a

RE: Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Thanks Andy, I certainly will give a try to your suggestion. From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, December 08, 2015 1:21 PM To: Lin, Hao; Jörn Franke Cc: user@spark.apache.org Subject: Re: Graph visualization tool for GraphX Hello Lin, This is indeed a tough

RE: Executor metrics in spark application

2015-12-08 Thread SRK
Hi, Were you able to setup custom metrics in GangliaSink? If so, how did you register the custom metrics? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executor-metrics-in-spark-application-tp188p25647.html Sent from the Apache Spark User List

RE: SparkR read.df failed to read file from local directory

2015-12-08 Thread Sun, Rui
Hi, Boyu, Does the local file “/home/myuser/test_data/sparkR/flights.csv” really exist? I just tried, and had no problem creating a DataFrame from a local CSV file. From: Boyu Zhang [mailto:boyuzhan...@gmail.com] Sent: Wednesday, December 9, 2015 1:49 AM To: Felix Cheung Cc:

Re: Associating spark jobs with logs

2015-12-08 Thread sunil m
Thanks for replying ... Yes i did. I am not seeing the application-ids for jobs submitted to YARN when i query http://MY_HOST:18080/api/v1/applications/ When I query http://MY_HOST:18080/api/v1/applications/application_1446812769803_0011 it does not understand the application_id since it belongs

Re: About Spark On Hbase

2015-12-08 Thread fightf...@163.com
Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase Also, HBASE-13992 already integrates that feature into the hbase side, but that feature has not been released. Best, Sun. fightf...@163.com From: censj Date: 2015-12-09 15:04 To: user@spark.apache.org Subject: About

Re: About Spark On Hbase

2015-12-08 Thread censj
But this is dependent on CDH。I not install CDH。 > 在 2015年12月9日,15:18,fightf...@163.com 写道: > > Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase > > Also, HBASE-13992 >

Re: Can not see any spark metrics on ganglia-web

2015-12-08 Thread SRK
Hi, Where does *.sink.csv.directory directory get created? I cannot see nay metrics in logs. How did you verify consoleSink and csvSink? Thanks! -- View this message in context:

set up spark 1.4.1 as default spark engine in HDP 2.2/2.3

2015-12-08 Thread Divya Gehlot
Hi, As per requirement I need to use Spark 1.4.1.But HDP doesnt comes with Spark 1.4.1 version. As instructed in this hortonworks page I am able to set up Spark 1.4 in HDP ,but when I run the spark shell It

Re: Spark metrics for ganglia

2015-12-08 Thread swetha kasireddy
Hi, How to verify whether the GangliaSink directory got created? Thanks, Swetha On Mon, Dec 15, 2014 at 11:29 AM, danilopds wrote: > Thanks tsingfu, > > I used this configuration based in your post: (with ganglia unicast mode) > # Enable GangliaSink for all instances >

How to get custom metrics using Ganglia Sink?

2015-12-08 Thread SRK
Hi, How do I configure custom metrics using Ganglia Sink? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-custom-metrics-using-Ganglia-Sink-tp25645.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: set up spark 1.4.1 as default spark engine in HDP 2.2/2.3

2015-12-08 Thread Saisai Shao
Please make sure the spark shell script you're running is pointed to /bin/spark-shell Just follow the instructions to correctly configure your spark 1.4.1 and execute correct script are enough. On Wed, Dec 9, 2015 at 11:28 AM, Divya Gehlot wrote: > Hi, > As per

Re: About Spark On Hbase

2015-12-08 Thread Fengdong Yu
https://github.com/nerdammer/spark-hbase-connector This is better and easy to use. > On Dec 9, 2015, at 3:04 PM, censj wrote: > > hi all, > now I using spark,but I not found spark operation hbase open source. Do > any one tell me? >

Re: Spark Java.lang.NullPointerException

2015-12-08 Thread michael_han
Hi Sarala, Thanks for your reply. But it doesn't work. I tried the following 2 commands: *<1>* spark-submit --master local --name "SparkTest App" --class com.qad.SparkTest1 target/Spark-Test-1.0.jar;c:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar with error:

About Spark On Hbase

2015-12-08 Thread censj
hi all, now I using spark,but I not found spark operation hbase open source. Do any one tell me?

Re: Can't create UDF's in spark 1.5 while running using the hive thrift service

2015-12-08 Thread Jeff Zhang
It is fixed in 1.5.3 https://issues.apache.org/jira/browse/SPARK-11191 On Wed, Dec 9, 2015 at 12:58 AM, Deenar Toraskar wrote: > Hi Trystan > > I am facing the same issue. It only appears with the thrift server, the > same call works fine via the spark-sql shell. Do

Re: Can not see any spark metrics on ganglia-web

2015-12-08 Thread SRK
Hi, I cannot see any metrics as well. How did you verify ConsoleSink and CSVSink works OK? Where does *.sink.csv.directory get created? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p25644.html Sent

How to use collections inside foreach block

2015-12-08 Thread Madabhattula Rajesh Kumar
Hi, I have a below query. Please help me to solve this I have a 2 ids. I want to join these ids to table. This table contains some blob data. So i can not join these 2000 ids to this table in one step. I'm planning to join this table in a chunks. For example, each step I will join 5000 ids.

Re: Exception in Spark-sql insertIntoJDBC command

2015-12-08 Thread kali.tumm...@gmail.com
Hi All, I have the same error in spark 1.5 is there any solution to get around with this ? I also tried using sourcedf.write.mode("append") but still no luck . val sourcedfmode=sourcedf.write.mode("append") sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops) Thanks Sri --

spark-defaults.conf optimal configuration

2015-12-08 Thread cjrumble
I am seeking help with a Spark configuration running queries against a cluster of 6 machines. Each machine has Spark 1.5.1 with slaves started on 6 and 1 acting as master/thriftserver. I query from Beeline 2 tables that have 300M and 31M rows respectively. Results from my queries thus far return

INotifyDStream - where to find it?

2015-12-08 Thread octagon blue
Hi All, I am using pyspark streaming to ETL raw data files as they land on HDFS. While researching this topic I found this great presentation about Spark and Spark Streaming at Uber (http://www.slideshare.net/databricks/spark-meetup-at-uber), where they mention this INotifyDStream that sounds

Re: Local Mode: Executor thread leak?

2015-12-08 Thread Richard Marscher
Alright I was able to work through the problem. So the owning thread was one from the executor task launch worker, which at least in local mode runs the task and the related user code of the task. After judiciously naming every thread in the pools in the user code (with a custom ThreadFactory) I

Re: Merge rows into csv

2015-12-08 Thread ayan guha
reduceByKey would be a perfect fit for you On Wed, Dec 9, 2015 at 4:47 AM, Krishna wrote: > Hi, > > what is the most efficient way to perform a group-by operation in Spark > and merge rows into csv? > > Here is the current RDD > - > ID STATE >