Re: Filter RDD

2015-10-19 Thread Ted Yu
See the filter() method: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L334 Cheers On Mon, Oct 19, 2015 at 4:27 PM, Shepherd wrote: > Hi all, > I have a very simple question. > I have a RDD, saying r1, which contains 5

Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Ted Yu
the JIRA right? > > On Sun, Oct 18, 2015 at 9:20 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> The udf is defined in GenericUDAFPercentileApprox of hive. >> >> When spark-shell runs, it has access to the above class which is packaged >> in assembly/target/sc

Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Ted Yu
ew thread following old thread looks like code for compiling > callUdf("percentile_approx",col("mycol"),lit(0.25)) is not merged in spark > 1.5.1 source but I dont understand why this function call works in Spark > 1.5.1 spark-shell/bin. Please guide. > > ------

Re: our spark gotchas report while creating batch pipeline

2015-10-18 Thread Ted Yu
Interesting reading material. bq. transformations that loose partitioner lose partitioner bq. Spark looses the partitioner loses the partitioner bq. Tunning number of partitions Should be tuning. bq. or increase shuffle fraction bq. ShuffleMemoryManager: Thread 61 ... Hopefully SPARK-1

Re: No suitable Constructor found while compiling

2015-10-18 Thread Ted Yu
I see two argument ctor. e.g. /** Construct an RDD with just a one-to-one dependency on one parent */ def this(@transient oneParent: RDD[_]) = this(oneParent.context , List(new OneToOneDependency(oneParent))) Looks like Tuple in your code is T in the following: abstract class RDD[T:

Re: Convert SchemaRDD to RDD

2015-10-16 Thread Ted Yu
bq. type mismatch found String required Serializable See line 110: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/String.java#109 Can you pastebin the complete stack trace for the error you encountered ? Cheers On Fri, Oct 16, 2015 at 8:01 AM, satish

Re: HBase Spark Streaming giving error after restore

2015-10-16 Thread Ted Yu
Can you show the complete stack trace ? Subclass of Mutation is expected. Put is a subclass. Have you tried replacing BoxedUnit with Put in your code ? Cheers On Fri, Oct 16, 2015 at 6:02 AM, Amit Singh Hora wrote: > Hi All, > > I am using below code to stream data from

Re: Convert SchemaRDD to RDD

2015-10-16 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTt9YBFr17u8j8=Scala+Limitation+Case+Class+definition+with+more+than+22+arguments On Fri, Oct 16, 2015 at 7:41 AM, satish chandra j wrote: > Hi All, > To convert SchemaRDD to RDD below snipped is working if SQL

Re: Application not found in Spark historyserver in yarn-client mode

2015-10-14 Thread Ted Yu
Which Spark release are you using ? Thanks On Wed, Oct 14, 2015 at 4:20 PM, Anfernee Xu wrote: > Hi, > > Here's the problem I'm facing, I have a standalone java application which > is periodically submit Spark jobs to my yarn cluster, btw I'm not using > 'spark-submit'

Re: writing to hive

2015-10-14 Thread Ted Yu
Can you show your query ? Thanks > On Oct 13, 2015, at 12:29 AM, Hafiz Mujadid wrote: > > hi! > > I am following this > > > tutorial to read and write from hive. But i am facing

Re: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-14 Thread Ted Yu
This might be related : http://search-hadoop.com/m/q3RTta8AxS1UjMSI=Cannot+get+spark+streaming_2+10+1+5+0+pom+from+the+maven+repository > On Oct 12, 2015, at 11:30 PM, Akhil Das wrote: > > You need to add "org.apache.spark" % "spark-streaming_2.10" % "1.5.0" to the

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Ted Yu
Adrian: Likely you were using maven. Jakob's report was with sbt. Cheers On Tue, Oct 13, 2015 at 10:05 PM, Adrian Tanase wrote: > Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also > compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
wrote: > Hi Ted if fix went after 1.5.1 release then how come it's working with > 1.5.1 binary in spark-shell. > On Oct 13, 2015 1:32 PM, "Ted Yu" <yuzhih...@gmail.com> wrote: > >> Looks like the fix went in after 1.5.1 was released. >> >> You may verify

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
works using 1.5.1 but it doesn't compile in Java using 1.5.1 > maven libraries it still complains same that callUdf can have string and > column types only. Please guide. > >> On Oct 13, 2015 12:34 AM, "Ted Yu" <yuzhih...@gmail.com> wrote: >> SQL context availa

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
;umesh.ka...@gmail.com> wrote: >>> >>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you >>> mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 >>> maven libraries it still complains same that callUdf can have s

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
callUDF("percentile_approx",col("mycol"), lit(0.25))) >> >> I am using Intellij editor java and maven dependencies of spark core >> spark sql spark hive version 1.5.1 >> On Oct 13, 2015 18:21, "Ted Yu" <yuzhih...@gmail.com> wrote: >> &

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
line doesn't compile in my spark job > > sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25))) > > I am using Intellij editor java and maven dependencies of spark core spark > sql spark hive version 1.5.1 > On Oct 13, 2015 18:21, "Ted

Re: Building with SBT and Scala 2.11

2015-10-13 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtY7aX22B44dB On Tue, Oct 13, 2015 at 5:53 PM, Jakob Odersky wrote: > I'm having trouble compiling Spark with SBT for Scala 2.11. The command I > use is: > > dev/change-version-to-2.11.sh > build/sbt -Pyarn -Phadoop-2.11

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-13 Thread Ted Yu
Still 404 as of a moment ago. On Mon, Oct 12, 2015 at 9:04 PM, Ted Yu <yuzhih...@gmail.com> wrote: > I checked commit history of streaming/pom.xml > > There should be no difference between 1.5.0 and 1.5.1 > > You can download 1.5.1's pom.xml and rename it so that you get

Re: TaskMemoryManager. cleanUpAllAllocatedMemory -> Memory leaks ???

2015-10-12 Thread Ted Yu
Please note the block where cleanUpAllAllocatedMemory() is called: } finally { val freedMemory = taskMemoryManager.cleanUpAllAllocatedMemory() if (freedMemory > 0) { I think the intention is that allocated memory should have been freed by the time we reach the finally

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
Umesh: Have you tried calling callUdf without the lit() parameter ? Cheers On Mon, Oct 12, 2015 at 6:27 AM, Umesh Kacha wrote: > Hi if you can help it would be great as I am stuck don't know how to > remove compilation error in callUdf when we pass three parameters

Re: Spark job is running infinitely

2015-10-12 Thread Ted Yu
Do you have monitoring put in place to detect 'no space left' scenario ? By 'way to kill job', do you mean automatic kill ? Please include the release of Spark, command line for 'spark-submit' in your reply. Thanks On Mon, Oct 12, 2015 at 10:07 AM, Saurav Sinha wrote:

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
Using spark-shell, I did the following exercise (master branch) : SQL context available as sqlContext. scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") df: org.apache.spark.sql.DataFrame = [id: string, value: int] scala> sqlContext.udf.register("simpleUDF", (v: Int,

Re: Spark job is running infinitely

2015-10-12 Thread Ted Yu
; Saurav > > On Mon, Oct 12, 2015 at 10:46 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Do you have monitoring put in place to detect 'no space left' scenario ? >> >> By 'way to kill job', do you mean automatic kill ? >> >> Please include the release of

Re: Spark job is running infinitely

2015-10-12 Thread Ted Yu
service for me. > > Thanks, > Saurav > > On Mon, Oct 12, 2015 at 11:47 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> I would suggest you install monitoring service. >> 'no space left' condition would affect other services, not just Spark. >> >> For the s

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile >> error >> >> //compile error because callUdf() takes String and Column* as arguments. >> >> Please guide. Thanks much. >> >> O

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
I use Spark 1.5.1 I cant see any partitions files orc files getting > created in HDFS I can see empty partitions directory under Hive table along > with many staging files created by spark. > > On Tue, Oct 13, 2015 at 12:34 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> S

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread Ted Yu
I got 404 as well. BTW 1.5.1 has been released. I was able to access: http://central.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.5.1/spark-streaming_2.10-1.5.1.pom FYI On Mon, Oct 12, 2015 at 8:09 PM, y wrote: > When I access the following URL, I often

Re: TaskMemoryManager. cleanUpAllAllocatedMemory -> Memory leaks ???

2015-10-12 Thread Ted Yu
Please take a look at the design doc attached to SPARK-1 The answer is on page 2 of that doc. On Mon, Oct 12, 2015 at 8:55 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Please note the block where cleanUpAllAllocatedMemory() is called: > } finally { >

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread Ted Yu
gt; > Yes, I know 1.5.1 is available but I need to use 1.5.0 because I need to > run Spark applications on Cloud Dataproc ( > https://cloud.google.com/dataproc/ ) which supports only 1.5.0. > > On Tue, Oct 13, 2015 at 12:13 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> I

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Ted Yu
Some weekend reading: http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative Cheers On Sun, Oct 11, 2015 at 5:32 PM, Cheng, Hao wrote: > A join B join C === (A join B) join C > > Semantically they are equivalent, right? > > > > *From:* Richard Eggert

Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Ted Yu
bq. all sort of optimizations like Tungsten For Tungsten, please use 1.5.1 release. On Sat, Oct 10, 2015 at 6:24 PM, Alex Rovner wrote: > How many executors are you running with? How many nodes in your cluster? > > > On Thursday, October 8, 2015, unk1102

Re: Cache in Spark

2015-10-09 Thread Ted Yu
For RDD, I found this method: def getStorageLevel: StorageLevel = storageLevel FYI On Fri, Oct 9, 2015 at 2:46 AM, vinod kumar wrote: > Thanks Natu, > > If so,Can you please share me the Spark SQL query to check whether the > given table is cached or not? if you

Re: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread Ted Yu
This is related: SPARK-10501 On Fri, Oct 9, 2015 at 7:28 AM, java8964 wrote: > Hi, Sparkers: > > In this case, I want to use Spark as an ETL engine to load the data from > Cassandra, and save it into HDFS. > > Here is the environment specified information: > > Spark 1.3.1

Re: Datastore or DB for spark

2015-10-09 Thread Ted Yu
There are connectors for hbase, Cassandra, etc. Which data store do you use now ? Cheers > On Oct 9, 2015, at 3:10 AM, Rahul Jeevanandam wrote: > > Hi Guys, > > I wanted to know what is the databases that you associate with spark? > > -- > Regards, > Rahul J

Re: OutOfMemoryError

2015-10-09 Thread Ted Yu
You can add it in in conf/spark-defaults.conf # spark.executor.extraJavaOptions -XX:+PrintGCDetails FYI On Fri, Oct 9, 2015 at 3:07 AM, Ramkumar V wrote: > How to increase the Xmx of the workers ? > > *Thanks*, > > > > On

Re: Re: Re: Error in load hbase on spark

2015-10-09 Thread Ted Yu
gt; > > At 2015-10-09 11:04:35, "Ted Yu" <yuzhih...@gmail.com> wrote: > > The second code snippet is similar to: > examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala > > See the comment in HBaseTest.scala : > // please ensure HBASE_CONF_

Re: Error in load hbase on spark

2015-10-09 Thread Ted Yu
Work for hbase-spark module is still ongoing https://issues.apache.org/jira/browse/HBASE-14406 > On Oct 9, 2015, at 6:18 AM, Guru Medasani wrote: > > Hi Roy, > > Here is a cloudera-labs project SparkOnHBase that makes it really simple to > read HBase data into Spark. > >

Re: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread Ted Yu
I guess that should work :-) On Fri, Oct 9, 2015 at 10:46 AM, java8964 wrote: > Thanks, Ted. > > Does this mean I am out of luck for now? If I use HiveContext, and cast > the UUID as string, will it work? > > Yong > > -- > Date: Fri, 9 Oct 2015

Re: Applicative logs on Yarn

2015-10-08 Thread Ted Yu
This question seems better suited for u...@hadoop.apache.org FYI On Thu, Oct 8, 2015 at 10:37 AM, wrote: > Hello, > I submit spark streaming inside Yarn, I have configured yarn to generate > custom logs. > It works fine and yarn aggregate very well the logs inside HDFS, >

Re: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Ted Yu
To fix the problem, consider increasing number of partitions for your job. Showing code snippet would help us understand your use case better. Cheers On Thu, Oct 8, 2015 at 1:39 PM, Ted Yu <yuzhih...@gmail.com> wrote: > See the comment of FramedSerializer() in seria

Re: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Ted Yu
See the comment of FramedSerializer() in serializers.py : Serializer that writes objects as a stream of (length, data) pairs, where C{length} is a 32-bit integer and data is C{length} bytes. Hence the limit on the size of object. On Thu, Oct 8, 2015 at 12:56 PM, XIANDI

Re: unsubscribe

2015-10-08 Thread Ted Yu
Take a look at the first section of: http://spark.apache.org/community On Thu, Oct 8, 2015 at 2:10 PM, Jürgen Fey wrote: > >

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Ted Yu
bq. contains 12 files/blocks Looks like you hit the limit of parallelism these files can provide. If you have larger dataset, you would have more partitions. On Thu, Oct 8, 2015 at 12:21 PM, Umesh Kacha wrote: > Hi Lan thanks for the reply. I have tried to do the

Re: Re: Error in load hbase on spark

2015-10-08 Thread Ted Yu
new SparkConf().setAppName("HBaseIntoSpark"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > Configuration conf = HBaseConfiguration.create(); > String tableName = "SecuMain"; > conf.set(TableInputFormat.INPUT_TABLE, tableName); > > also can't wok! > >

Re: Error in load hbase on spark

2015-10-08 Thread Ted Yu
One possibility was that hbase config, including hbase.zookeeper.quorum, was not passed to your job. hbase-site.xml should be on the classpath. Can you show snippet of your code ? Looks like you were running against hbase 1.x Cheers On Thu, Oct 8, 2015 at 7:29 PM, Roy Wang

Re: Asking about the trend of increasing latency, hbase spikes.

2015-10-07 Thread Ted Yu
This question should be directed to user@ Can you use third party site for the images - they didn't go through. On Wed, Oct 7, 2015 at 5:35 PM, UI-JIN LIM wrote: > Hi. This is Ui Jin, Lim in Korea, LG CNS > > > > We had setup and are operating hbase 0.98.13 on our customer,

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Ted Yu
Hemant: Can you post the code snippet to the mailing list - other people would be interested. On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat wrote: > Will send you the code on your email id. > > On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen wrote: > >>

Re: does KafkaCluster can be public ?

2015-10-06 Thread Ted Yu
Or maybe annotate with @DeveloperApi Cheers On Tue, Oct 6, 2015 at 7:24 AM, Cody Koeninger wrote: > I personally think KafkaCluster (or the equivalent) should be made > public. When I'm deploying spark I just sed out the private[spark] and > rebuild. > > There's a general

Re: compatibility issue with Jersey2

2015-10-06 Thread Ted Yu
Maybe build Spark with -Djersey.version=2.9 ? Cheers On Tue, Oct 6, 2015 at 5:57 AM, oggie wrote: > I have some jersey compatibility issues when I tried to upgrade from 1.3.1 > to > 1.4.1.. > > We have a Java app written with spark 1.3.1. That app also uses Jersey 2.9 >

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtwwjNxXvPEe1 A brief search in Spark JIRAs didn't find anything opened on this subject. On Tue, Oct 6, 2015 at 8:51 AM, unk1102 wrote: > Hi I have a spark job which creates ORC files in partitions using the > following code

Re: API to run spark Jobs

2015-10-06 Thread Ted Yu
Please take a look at: org.apache.spark.deploy.rest.RestSubmissionClient which is used by core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala FYI On Tue, Oct 6, 2015 at 10:08 AM, shahid qadri wrote: > hi Jeff > Thanks > More specifically i need the Rest api

Re: Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread Ted Yu
Considering posting the question on vendor's forum. HDP 2.3 comes with Spark 1.4 if I remember correctly. On Tue, Oct 6, 2015 at 9:05 AM, czoo wrote: > Hi, > > This post might be a duplicate with updates from another one (by me), sorry > in advance > > I have an HDP 2.3

Re: Exception: "You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly"

2015-10-05 Thread Ted Yu
-1.5.0-bin-hadoop2.6" (windows > 7). > To launch spark i use the prompt command (dos): > bin\pyspark --jars "my_path_to_mysql_jdbc.jar" > > This command starts a notebook pyspark without errors. > > > 2015-10-05 18:29 GMT+02:00 Ted Yu <yuzhih...@gmail.com&g

Re: Error: could not find function "includePackage"

2015-10-05 Thread Ted Yu
includePackage is defined in R/pkg/R/context.R FYI On Mon, Oct 5, 2015 at 6:46 AM, jayendra.par...@yahoo.in < jayendra.par...@yahoo.in> wrote: > As mentioned on the website that “includePackage” command can be used to > include existing R packages, but when I am using this command R is giving >

Re: Spark on YARN using Java 1.8 fails

2015-10-05 Thread Ted Yu
YARN 2.7.1 (running on the cluster) was built with Java 1.8, I assume. Have you used the following command to retrieve / inspect logs ? yarn logs -applicationId Cheers On Mon, Oct 5, 2015 at 8:41 AM, mvle wrote: > Hi, > > I have successfully run pyspark on Spark 1.5.1 on YARN

Re: Exception: "You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly"

2015-10-05 Thread Ted Yu
What command did you use to build Spark 1.5.0 ? bq. Export 'SPARK_HIVE=true' and run build/sbt assembly Please following the above. BTW 1.5.1 has been released which is more stable. Please use 1.5.1 Cheers On Mon, Oct 5, 2015 at 9:25 AM, cherah30 wrote: > I work

Re: How to install a Spark Package?

2015-10-04 Thread Ted Yu
Are you talking about package which is listed on http://spark-packages.org The package should come with installation instructions, right ? > On Oct 4, 2015, at 8:55 PM, jeff saremi wrote: > > So that it is available even in offline mode? I can't seem to be able to find

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-10-04 Thread Ted Yu
1.2.0 is quite old. You may want to try 1.5.1 which was released in the past week. Cheers > On Oct 4, 2015, at 4:26 AM, t_ras wrote: > > I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying > coutn action on a file. > > The file is a CSV file

Re: Mini projects for spark novice

2015-10-04 Thread Ted Yu
See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark FYI On Sun, Oct 4, 2015 at 7:06 AM, Rahul Jeevanandam wrote: > I am currently learning Spark and I wanna solidify my knowledge on Spark, > hence I wanna do some projects on it. Can you suggest me

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ted Yu
bq. val dist = sc.parallelize(l) Following the above, can you call, e.g. count() on dist before saving ? Cheers On Fri, Oct 2, 2015 at 1:21 AM, jarias wrote: > Dear list, > > I'm experimenting a problem when trying to write any RDD to HDFS. I've > tried > with minimal

Re: Contribution in Apche Spark

2015-10-03 Thread Ted Yu
Please more of your code snippet and the complete error. See als python/pyspark/tests.py for examples. Cheers On Fri, Oct 2, 2015 at 11:56 PM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > While typing following line into Hortonworks terminal, I'm getting *Syntax > Error:invalid

Re: WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,...

2015-10-03 Thread Ted Yu
Did you use spark-shell ? In spark-shell, there can only be one running SparkContext which is created automatically. Cheers On Sat, Oct 3, 2015 at 11:27 AM, Jacek Laskowski wrote: > Hi, > > The following WARN happens in Spark built from today's sources. There > were some

Re: How to make sense of Spark log entries

2015-10-03 Thread Ted Yu
Every commonly seen error has been discussed multiple times. Meaning, you can find related discussions / JIRAs using indexing services, such as: http://search-hadoop.com/ Here is one related talk: http://www.slideshare.net/Hadoop_Summit/why-your-spark-job-is-failing FYI On Sat, Oct 3, 2015 at

Re: how to broadcast huge lookup table?

2015-10-02 Thread Ted Yu
Have you considered using external storage such as hbase for storing the look up table ? Cheers On Fri, Oct 2, 2015 at 11:50 AM, wrote: > I tried broadcasting a key-value rdd, but then I cannot perform any > rdd-actions inside a map/foreach function of another

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-02 Thread Ted Yu
I got the following when parsing your input with master branch (Python version 2.6.6): http://pastebin.com/1w8WM3tz FYI On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan wrote: > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. > > I'm trying to read in a

Re: How to connect HadoopHA from spark

2015-10-01 Thread Ted Yu
Have you setup HADOOP_CONF_DIR in spark-env.sh correctly ? Cheers On Thu, Oct 1, 2015 at 5:22 AM, Vinoth Sankar wrote: > Hi, > > How do i connect HadoopHA from SPARK. I tried overwriting hadoop > configurations from sparkCong. But Still I'm getting UnknownHostException >

Re: Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread Ted Yu
In your second command, have you tried changing the comma to colon ? Cheers On Thu, Oct 1, 2015 at 8:56 AM, YaoPau wrote: > I'm trying to add multiple SerDe jars to my pyspark session. > > I got the first one working by changing my PYSPARK_SUBMIT_ARGS to: > > "--master

Re: How to access lost executor log file

2015-10-01 Thread Ted Yu
Can you go to YARN RM UI to find all the attempts for this Spark Job ? The two lost executors should be found there. On Thu, Oct 1, 2015 at 10:30 AM, Lan Jiang wrote: > Hi, there > > When running a Spark job on YARN, 2 executors somehow got lost during the > execution. The

Re: python version in spark-submit

2015-10-01 Thread Ted Yu
PYSPARK_PYTHON determines what the worker uses. PYSPARK_DRIVER_PYTHON is for driver. See the comment at the beginning of bin/pyspark FYI On Thu, Oct 1, 2015 at 1:56 PM, roy wrote: > Hi, > > We have python2.6 (default) on cluster and also we have installed > python2.7. > > I

Re: How to access lost executor log file

2015-10-01 Thread Ted Yu
the application overview section. When I click it, it > brings me to the spark history server UI, where I cannot find the lost > exectuors. The only logs link I can find one the YARN RM site is the > ApplicationMaster log, which is not what I need. Did I miss something? > > Lan > > On Thu, O

Re: How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Ted Yu
See the tail of this: https://bugzilla.redhat.com/show_bug.cgi?id=1005811 FYI > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg > wrote: > > Is there a way to ensure Spark doesn't write to /tmp directory? > > We've got spark.local.dir specified in the

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-30 Thread Ted Yu
ue > though it now may be harder to reproduce it. Thanks for the suggestion. > > On Tue, Sep 29, 2015 at 8:03 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Have you tried the following ? >> --conf spark.driver.userClassPathFirst=true --conf spark.executor. >> user

Re: Hive alter table is failing

2015-09-29 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtGwP431AQ2B41 Plugin metastore version for your deployment. Cheers > On Sep 29, 2015, at 5:20 AM, Ophir Cohen wrote: > > Hi, > > I'm using Spark on top of Hive. > As I want to keep old tables I store the DataFrame

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON

2015-09-29 Thread Ted Yu
sqlContext.read.json() expects Path to the JSON file. FYI On Tue, Sep 29, 2015 at 7:23 AM, Fernando Paladini wrote: > Hello guys, > > I'm very new to Spark and I'm having some troubles when reading a JSON to > dataframe on PySpark. > > I'm getting a JSON object from an

Re: Executor Lost Failure

2015-09-29 Thread Ted Yu
Can you list the spark-submit command line you used ? Thanks On Tue, Sep 29, 2015 at 9:02 AM, Anup Sawant wrote: > Hi all, > Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new > to Spark so I have less knowledge about the internals of it. The

Re: SparkContext._active_spark_context returns None

2015-09-29 Thread Ted Yu
r reply. The sc works at driver, but how can I reach the > JVM in rdd.map ? > > 2015-09-29 11:26 GMT+08:00 Ted Yu <yuzhih...@gmail.com>: > >>>> sc._jvm.java.lang.Integer.valueOf("12") > > 12 > > > > FYI > > > > On Mon, Sep 28, 2015

Re: Does pyspark in cluster mode need python on individual executor nodes ?

2015-09-29 Thread Ted Yu
I think the answer is yes. Code packaged in pyspark.zip needs python to execute. On Tue, Sep 29, 2015 at 2:08 PM, Ranjana Rajendran < ranjana.rajend...@gmail.com> wrote: > Hi, > > Does a python spark program (which makes use of pyspark ) submitted in > cluster mode need python on the executor

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Ted Yu
Mind providing a bit more information: release of Spark command line for running Spark job Cheers On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg wrote: > We're seeing this occasionally. Granted, this was caused by a wrinkle in > the Solr schema but this bubbled

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Ted Yu
re's some class loading pattern here > where some classes may not get loaded out of the consumer jar and therefore > have to have their respective jars added to the executor extraClassPath? > > Or is this a serialization problem for SolrException as Divya > Ravichandran sugges

Re: input file from tar.gz

2015-09-29 Thread Ted Yu
The syntax using '#' is not supported by hdfs natively. YARN resource localization supports such notion. See http://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html Not sure about Spark. On Tue, Sep 29, 2015 at 11:39 AM, Peter

Re: How to set System environment variables in Spark

2015-09-29 Thread Ted Yu
Please see 'spark.executorEnv.[EnvironmentVariableName]' in https://spark.apache.org/docs/latest/configuration.html#runtime-environment FYI On Tue, Sep 29, 2015 at 12:29 PM, swetha wrote: > > Hi, > > How to set System environment variables when submitting a job?

Re: SparkContext._active_spark_context returns None

2015-09-28 Thread Ted Yu
>>> sc._jvm.java.lang.Integer.valueOf("12") 12 FYI On Mon, Sep 28, 2015 at 8:08 PM, YiZhi Liu wrote: > Hi, > > I'm doing some data processing on pyspark, but I failed to reach JVM > in workers. Here is what I did: > > $ bin/pyspark > >>> data = sc.parallelize(["123",

Re: Spark SQL: Implementing Custom Data Source

2015-09-28 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTttmiYDqGc202 And: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources > On Sep 28, 2015, at 8:22 PM, Jerry Lam wrote: > > Hi spark users and developers, > > I'm trying to learn how implement a

Re: java.lang.ClassCastException (org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task)

2015-09-28 Thread Ted Yu
Please see SPARK-8142 On Mon, Sep 28, 2015 at 1:45 PM, amitra123 wrote: > Hello All, > > I am trying to write a very simply Spark Streaming example problem and I m > getting this exception. I am new to Spark and I am not quite sure why this > exception is thrown. Wondering

Re: Update cassandra rows problem

2015-09-28 Thread Ted Yu
Please consider posting on DataStax's mailing list for question w.r.t. spark cassandra connector On Mon, Sep 28, 2015 at 6:59 AM, amine_901 wrote: > Hello all, > i'm using spark 1.2 with spark cassandra connector 1.2.3, > i'm trying to update somme rows of table:

Re: HDFS is undefined

2015-09-28 Thread Ted Yu
Please post the question on vendor's forum. > On Sep 25, 2015, at 7:13 AM, Angel Angel wrote: > > hello, > I am running the spark application. > > I have installed the cloudera manager. > it includes the spark version 1.2.0 > > > But now i want to use spark version

Re: CassandraSQLContext throwing NullPointer Exception

2015-09-28 Thread Ted Yu
Which Spark release are you using ? Can you show the snippet of your code around CassandraSQLContext#sql() ? Thanks On Mon, Sep 28, 2015 at 6:21 AM, Priya Ch wrote: > Hi All, > > I am trying to use dataframes (which contain data from cassandra) in > rdd.foreach.

Re: queup jobs in spark cluster

2015-09-26 Thread Ted Yu
Related thread: http://search-hadoop.com/m/q3RTt31EUSYGOj82 Please see: https://spark.apache.org/docs/latest/security.html FYI On Sat, Sep 26, 2015 at 4:03 PM, manish ranjan wrote: > Dear All, > > I have a small spark cluster for academia purpose and would like it to be

Re: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Ted Yu
Is the Schema.parse() call expensive ? Can you call it in the closure ? On Fri, Sep 25, 2015 at 10:06 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm getting a NotSerializableException even though I'm creating all the my > objects from within the closure: > import

Re: No space left on device when running graphx job

2015-09-24 Thread Ted Yu
Andy: Can you show complete stack trace ? Have you checked there are enough free inode on the .129 machine ? Cheers > On Sep 23, 2015, at 11:43 PM, Andy Huang wrote: > > Hi Jack, > > Are you writing out to disk? Or it sounds like Spark is spilling to disk (RAM >

Re: Spark ClosureCleaner or java serializer OOM when trying to grow

2015-09-24 Thread Ted Yu
Please decrease spark.serializer.objectStreamReset for your queries. The default value is 100. I logged SPARK-10787 for improvement. Cheers On Wed, Sep 23, 2015 at 6:59 PM, jluan wrote: > I have been stuck on this problem for the last few days: > > I am attempting to run

Re: How to control spark.sql.shuffle.partitions per query

2015-09-23 Thread Ted Yu
Please take a look at the following for example: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala Search for spark.sql.shuffle.partitions and SQLConf.SHUFFLE_PARTITIONS.key FYI On Wed, Sep 23, 2015 at 12:42 AM, tridib

Re: How to turn off Jetty Http stack errors on Spark web

2015-09-23 Thread Ted Yu
Have you read this ? http://stackoverflow.com/questions/2246074/how-do-i-hide-stack-traces-in-the-browser-using-jetty On Wed, Sep 23, 2015 at 6:56 AM, Rafal Grzymkowski wrote: > Hi, > > Is it possible to disable Jetty stack trace with errors on Spark > master:8080 ? > When I trigger

Re: Spark as standalone or with Hadoop stack.

2015-09-23 Thread Ted Yu
HDFS on Mesos framework is still being developed. What I said previously reflected current deployment practice. Things may change in the future. On Tue, Sep 22, 2015 at 4:02 PM, Jacek Laskowski <ja...@japila.pl> wrote: > On Tue, Sep 22, 2015 at 10:03 PM, Ted Yu <yuzhih...@gmai

Re: How to obtain the key in updateStateByKey

2015-09-23 Thread Ted Yu
def updateStateByKey[S: ClassTag]( updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)], updateFunc is given an iterator. You can access the key with _1 on the iterator. On Wed, Sep 23, 2015 at 3:01 PM, swetha wrote: > Hi, > > How to obtain the

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Ted Yu
bq. it's relatively harder to use it with HBase I agree with Sean. I work on HBase. To my knowledge, no one runs HBase on top of Mesos. On Tue, Sep 22, 2015 at 12:31 PM, Sean Owen wrote: > Who told you Mesos would make Spark 100x faster? does it make sense > that just the

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-22 Thread Ted Yu
Can you switch to 2.11 ? The following has been fixed in 2.11: https://issues.scala-lang.org/browse/SI-7296 Otherwise consider packaging related values into a case class of their own. On Tue, Sep 22, 2015 at 8:48 PM, satish chandra j wrote: > HI All, > Do we have any

Re: Heap Space Error

2015-09-22 Thread Ted Yu
Have you tried suggestions given in this thread ? http://stackoverflow.com/questions/26256061/using-itext-java-lang-outofmemoryerror-requested-array-size-exceeds-vm-limit Can you pastebin complete stack trace ? What release of Spark are you using ? Cheers > On Sep 22, 2015, at 4:28 AM, Yusuf

Re: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Ted Yu
Have you tried using repartition to spread the load ? Cheers > On Sep 22, 2015, at 4:22 AM, Chirag Dewan wrote: > > Hi, > > I am using Spark to access around 300m rows in Cassandra. > > My job is pretty simple as I am just mapping my row into a CSV format and >

<    5   6   7   8   9   10   11   12   13   14   >