Re: difference between ++ and Union of a RDD

2015-12-29 Thread Ted Yu
>From RDD.scala : def ++(other: RDD[T]): RDD[T] = withScope { this.union(other) They should be the same. On Tue, Dec 29, 2015 at 10:41 AM, email2...@gmail.com wrote: > Hello All - > > tried couple of operations by using ++ and union on RDD's but realized that > the

difference between ++ and Union of a RDD

2015-12-29 Thread email2...@gmail.com
Hello All - tried couple of operations by using ++ and union on RDD's but realized that the end results are same. Do you know any differences?. val odd_partA = List(1,3,5,7,9,11,1,3,5,7,9,11,1,3,5,7,9,11) odd_partA: List[Int] = List(1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11) val

Re: difference between ++ and Union of a RDD

2015-12-29 Thread Gokula Krishnan D
Ted - Thanks for the updates. Then its the same case with sc.parallelize() or sc.makeRDD() right. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Dec 29, 2015 at 1:43 PM, Ted Yu wrote: > From RDD.scala : > > def ++(other: RDD[T]): RDD[T] = withScope { >

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
Hi Greg, It's actually intentional for standalone cluster mode to not upload jars. One of the reasons why YARN takes at least 10 seconds before running any simple application is because there's a lot of random overhead (e.g. putting jars in HDFS). If this missing functionality is not documented

SparkSQL Hive orc snappy table

2015-12-29 Thread Dawid Wysakowicz
Hi, I have a table in hive stored as orc with compression = snappy. I try to execute a query on that table that fails (previously I run it on table in orc-zlib format and parquet so it is not the matter of query). I managed to execute this query with hive on tez on that tables. The exception i

Re: Stuck with DataFrame df.select("select * from table");

2015-12-29 Thread Annabel Melongo
Eugene, The example I gave you was in Python. I used it on my end and it works fine. Sorry, I don't know Scala. Thanks On Tuesday, December 29, 2015 5:24 AM, Eugene Morozov wrote: Annabel,  That might work in Scala, but I use Java. Three quotes just don't

Task hang problem

2015-12-29 Thread Darren Govoni
Hi,   I've had this nagging problem where a task will hang and the entire job hangs. Using pyspark. Spark 1.5.1 The job output looks like this, and hangs after the last task: .. 15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in

Re: Task hang problem

2015-12-29 Thread Ted Yu
Can you log onto 10.65.143.174 , find task 31 and take a stack trace ? Thanks On Tue, Dec 29, 2015 at 9:19 AM, Darren Govoni wrote: > Hi, > I've had this nagging problem where a task will hang and the entire job > hangs. Using pyspark. Spark 1.5.1 > > The job output

Re: Spark submit does automatically upload the jar to cluster?

2015-12-29 Thread jiml
And for more clarification on this: For non-YARN installs this bug has been filed to make the Spark driver upload jars The point of confusion, that I along with other newcomers commonly suffer from is this. In non-YARN installs: *The

Zip data frames

2015-12-29 Thread Daniel Siegmann
RDD has methods to zip with another RDD or with an index, but there's no equivalent for data frames. Anyone know a good way to do this? I thought I could just convert to RDD, do the zip, and then convert back, but ... 1. I don't see a way (outside developer API) to convert RDD[Row]

Re: Problem with WINDOW functions?

2015-12-29 Thread Chris Fregly
on quick glance, it appears that you're calling collect() in there which is bringing down a huge amount of data down to the single Driver. this is why, when you allocated more memory to the Driver, a different error emerges most -definitely related to stop-the-world GC to cause the node to

RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Which version are you using? Have you tried the 1.6? From: Vadim Tkachenko [mailto:apache...@gmail.com] Sent: Wednesday, December 30, 2015 10:17 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Problem with WINDOW functions? When I allocate 200g to executor, it is able to make better

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-29 Thread Andy Davidson
Hi Michael https://github.com/apache/spark/archive/v1.6.0.tar.gz Both 1.6.0 and 1.5.2 my unit test work when I call reparation(1) before saving output. Coalesce still fails. Coalesce(1) spark-1.5.2 Caused by: java.io.IOException: Unable to acquire 33554432 bytes of memory

回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-29 Thread Davies Liu
Hi Andy, Could you change logging level to INFO and post some here? There will be some logging about the memory usage of a task when OOM. In 1.6, the memory for a task is : (HeapSize - 300M) * 0.75 / number of tasks. Is it possible that the heap is too small? Davies -- Davies Liu

Re: Spark 1.5.2 compatible spark-cassandra-connector

2015-12-29 Thread fightf...@163.com
Hi, Vivek M I had ever tried 1.5.x spark-cassandra connector and indeed encounter some classpath issues, mainly for the guaua dependency. I believe that can be solved by some maven config, but have not tried that yet. Best, Sun. fightf...@163.com From: vivek.meghanat...@wipro.com Date:

Problem with WINDOW functions?

2015-12-29 Thread vadimtk
Hi, I can't successfully execute a query with WINDOW function. The statements are following: val orcFile = sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(project)='EN'") orcFile.registerTempTable("d1") sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day

RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Can you try to write the result into another file instead? Let's see if there any issue in the executors side . sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day ORDER BY pageviews DESC) as rank FROM d1").filter("rank <= 20").sort($"day",$"rank").write.parquet("/path/to/file")

Re: Problem with WINDOW functions?

2015-12-29 Thread Vadim Tkachenko
When I allocate 200g to executor, it is able to make better progress, that is I see 189 tasks executed instead of 169 previously. But eventually it fails with the same error. On Tue, Dec 29, 2015 at 5:58 PM, Cheng, Hao wrote: > Is there any improvement if you set a bigger

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Daniel Valdivia
That makes things more clear! Thanks Issue resolved Sent from my iPhone > On Dec 29, 2015, at 2:43 PM, Annabel Melongo > wrote: > > Thanks Andrew for this awesome explanation > > > On Tuesday, December 29, 2015 5:30 PM, Andrew Or >

RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Is there any improvement if you set a bigger memory for executors? -Original Message- From: va...@percona.com [mailto:va...@percona.com] On Behalf Of Vadim Tkachenko Sent: Wednesday, December 30, 2015 9:51 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Problem with WINDOW

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Divya Gehlot
Hello Community Users, I am able to resolve the issue . The issue was input data format ,By default Excel writes the data in 2001/01/09 whereas Spark Sql takes 2001-01-09 format. Here is the sample code below SQL context available as sqlContext. scala> import

Re: Spark 1.5.2 compatible spark-cassandra-connector

2015-12-29 Thread mwy
2.10-1.5.0-M3 & spark 1.5.2 work for me. The jar is built by sbt-assembly. Just for reference. 发件人: "fightf...@163.com" 日期: Wednesday, December 30, 2015 at 10:22 至: "vivek.meghanat...@wipro.com" , user 主题: Re: Spark

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications 2015-12-29 11:48 GMT-08:00 Annabel Melongo : > Greg, > > Can you please send me a doc describing the standalone cluster mode? > Honestly, I never heard about it. > > The three

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-29 Thread Ted Yu
Have you searched log for 'f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4' ? In the snippet you posted, I don't see registration of this Executor. Cheers On Tue, Dec 29, 2015 at 12:43 PM, Adrian Bridgett wrote: > We're seeing an "Executor is not registered" error on a Spark

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Thanks Andrew for this awesome explanation  On Tuesday, December 29, 2015 5:30 PM, Andrew Or wrote: Let me clarify a few things for everyone: There are three cluster managers: standalone, YARN, and Mesos. Each cluster manager can run in two deploy modes, client

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
> > The confusion here is the expression "standalone cluster mode". Either > it's stand-alone or it's cluster mode but it can't be both. @Annabel That's not true. There *is* a standalone cluster mode where driver runs on one of the workers instead of on the client machine. What you're describing

Re: Task hang problem

2015-12-29 Thread Darren Govoni
here's executor trace. Thread 58: Executor task launch worker-3 (RUNNABLE) java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.read(SocketInputStream.java:152)

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Greg, The confusion here is the expression "standalone cluster mode". Either it's stand-alone or it's cluster mode but it can't be both.  With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: client and driver run on the same machine; use --packages option to submit a

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-29 Thread Andrew Or
> > External shuffle service is backward compatible, so if you deployed 1.6 > shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications. Actually, it just happens to be backward compatible because we didn't change the shuffle file formats. This may not necessarily be the case

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Greg, Can you please send me a doc describing the standalone cluster mode? Honestly, I never heard about it. The three different modes, I've listed appear in the last paragraph of this doc: Running Spark Applications |   | |   |   |   |   |   | | Running Spark Applications--class The FQCN of the

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Andrew, Now I see where the confusion lays. Standalone cluster mode, your link, is nothing but a combination of client-mode and standalone mode, my link, without YARN. But I'm confused by this paragraph in your link:         If your application is launched through Spark submit, then the

Re: difference between ++ and Union of a RDD

2015-12-29 Thread Ted Yu
bq. same case with sc.parallelize() or sc.makeRDD() I think so. On Tue, Dec 29, 2015 at 10:50 AM, Gokula Krishnan D wrote: > Ted - Thanks for the updates. Then its the same case with sc.parallelize() > or sc.makeRDD() right. > > Thanks & Regards, > Gokula Krishnan*

Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-29 Thread Adrian Bridgett
We're seeing an "Executor is not registered" error on a Spark (1.6.0rc4, mesos-0.26) cluster. It seems as if the logic in MesosExternalShuffleService.scala isn't working for some reason (new in 1.6 I believe). spark application sees this: ... 15/12/29 18:49:41 INFO

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
Let me clarify a few things for everyone: There are three *cluster managers*: standalone, YARN, and Mesos. Each cluster manager can run in two *deploy modes*, client or cluster. In client mode, the driver runs on the machine that submitted the application (the client). In cluster mode, the driver

RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
It’s not released yet, probably you need to compile it yourself. In the meantime, can you increase the partition number? By setting the " spark.sql.shuffle.partitions” to a bigger value. And more details about your cluster size, partition size, yarn/standalone, executor resources etc. will be

Re: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-29 Thread Sourav Mazumder
Alternatively you can also try the ML library from System ML ( http://systemml.apache.org/) for covariance computation on Spark. Regards, Sourav On Mon, Dec 28, 2015 at 11:29 PM, Sun, Rui wrote: > Spark does not support computing cov matrix now. But there is a PR for > it.

RE: Does Spark SQL support rollup like HQL

2015-12-29 Thread Cheng, Hao
Hi, currently, the Simple SQL Parser of SQLContext is quite weak, and doesn’t support the rollup, but you can check the code https://github.com/apache/spark/pull/5080/ , which aimed to add the support, just in case you can patch it in your own branch. In Spark 2.0, the simple SQL Parser will

Does Spark SQL support rollup like HQL

2015-12-29 Thread Yi Zhang
Hi guys, As we know, hqlContext support rollup like this: hiveContext.sql("select a, b, sum(c) from t group by a, b with rollup") And I also knows that dataframe provides rollup function to support it: dataframe.rollup($"a", $"b").agg(Map("c" -> "sum")) But in my scenario, I'd better use sql

Re: Does Spark SQL support rollup like HQL

2015-12-29 Thread Davies Liu
Just sent out a PR[1] to support cube/rollup as function, it works with both SQLContext and HiveContext. https://github.com/apache/spark/pull/10522/files On Tue, Dec 29, 2015 at 9:35 PM, Yi Zhang wrote: > Hi Hao, > > Thanks. I'll take a look at it. > > > On

Re: Does Spark SQL support rollup like HQL

2015-12-29 Thread Yi Zhang
Hi Hao, Thanks. I'll take a look at it. On Wednesday, December 30, 2015 12:47 PM, "Cheng, Hao" wrote: #yiv7928789615 #yiv7928789615 -- _filtered #yiv7928789615 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv7928789615

RE: Spark 1.5.2 compatible spark-cassandra-connector

2015-12-29 Thread vivek.meghanathan
Thank you mwy and Sun for your response. Yes basic things are working for me using this connector(guava issue was encountered earlier but with proper exclusion of old version we have resolved it). The current issue is strange one �C we have a kafka-spark-cassandra streaming job in spark. The

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Hyukjin Kwon
I see, as far as I know Spark CSV datasource does not support custom date format but formal ones such as “2015-08-20 15:57:00”. Internally this uses Timestamp.valueOf() and Date.valueOf() to parse them. For me, it looks you can 1. modify and build the library by yourself for custom date

Re: [Spakr1.4.1] StuctField for date column in CSV file while creating custom schema

2015-12-29 Thread Raghavendra Pandey
U can use date type... On Dec 29, 2015 9:02 AM, "Divya Gehlot" wrote: > Hi, > I am newbee to Spark , > My appologies for such a naive question > I am using Spark 1.4.1 and wrtiting code in scala . I have input data as > CSVfile which I am parsing using spark-csv package

Re: ClassNotFoundException when executing spark jobs in standalone/cluster mode on Spark 1.5.2

2015-12-29 Thread Prem Spark
you need make sure this class is accessible to all servers since its a cluster mode and drive can be on any of the worker nodes. On Fri, Dec 25, 2015 at 5:57 PM, Saiph Kappa wrote: > Hi, > > I'm submitting a spark job like this: > >

Re: ClassNotFoundException when executing spark jobs in standalone/cluster mode on Spark 1.5.2

2015-12-29 Thread Saiph Kappa
I found out that by commenting this line in the application code: sparkConf.set("spark.executor.extraJavaOptions", " -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+AggressiveOpts -XX:FreqInlineSize=300 -XX:MaxInlineSize=300 ") the exception does not occur anymore. Not entirely sure why, but

Spark 1.5.2 compatible spark-cassandra-connector

2015-12-29 Thread vivek.meghanathan
All, What is the compatible spark-cassandra-connector for spark 1.5.2? I can only find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3 which has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any classpath issues needs to be handled or any jars needs to be

Re: Stuck with DataFrame df.select("select * from table");

2015-12-29 Thread Eugene Morozov
Annabel, That might work in Scala, but I use Java. Three quotes just don't compile =) If your example is in Scala, then, I believe, semicolon is not required. -- Be well! Jean Morozov On Mon, Dec 28, 2015 at 8:49 PM, Annabel Melongo wrote: > Jean, > > Try this: > >

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Greg Hill
On 12/28/15, 5:16 PM, "Daniel Valdivia" wrote: >Hi, > >I'm trying to submit a job to a small spark cluster running in stand >alone mode, however it seems like the jar file I'm submitting to the >cluster is "not found" by the workers nodes. > >I might have understood