Re: Spark SQL create table

2016-01-18 Thread Ted Yu
Please take a look at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala On Mon, Jan 18, 2016 at 9:57 AM, raghukiran wrote: > Is creating a table using the SparkSQLContext currently supported? > > Regards, > Raghu > > > > -- > View this message in

Re: Spark SQL create table

2016-01-18 Thread Ted Yu
c Regards On Mon, Jan 18, 2016 at 10:28 AM, Raghu Ganti <raghuki...@gmail.com> wrote: > This requires Hive to be installed and uses HiveContext, right? > > What is the SparkSQLContext useful for? > > On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu <yuzhih...@gmail.com> wrot

Re: rdd.foreach return value

2016-01-18 Thread Ted Yu
Here is signature for foreach: def foreach(f: T => Unit): Unit = withScope { I don't think you can return element in the way shown in the snippet. On Mon, Jan 18, 2016 at 7:34 PM, charles li wrote: > code snippet > > > ​ > the 'print' actually print info on the worker

Re: rdd.foreach return value

2016-01-18 Thread Ted Yu
reat thanks again > > > > ​ > >> On Tue, Jan 19, 2016 at 11:44 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> Here is signature for foreach: >> def foreach(f: T => Unit): Unit = withScope { >> >> I don't think you can return element in the way

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
Can you pass the properties which are needed for accessing Cassandra without going through SparkContext ? SparkContext isn't designed to be used in the way illustrated below. Cheers On Mon, Jan 18, 2016 at 12:29 PM, gpatcham wrote: > Hi, > > I have a use case where I need

Re: building spark 1.6 throws error Rscript: command not found

2016-01-18 Thread Ted Yu
Please see: http://www.jason-french.com/blog/2013/03/11/installing-r-in-linux/ On Mon, Jan 18, 2016 at 1:22 PM, Mich Talebzadeh wrote: > ./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 > -Phive -Phive-thriftserver -Pyarn > > > > > > INFO] ---

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
lt;gpatc...@gmail.com> wrote: > >> I'm using spark cassandra connector to do this and the way we access >> cassandra table is >> >> sc.cassandraTable("keySpace", "tableName") >> >> Thanks >> Giri >> >> On Mon, Jan 18, 2016

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
, Jan 18, 2016 at 1:44 PM, Giri P <gpatc...@gmail.com> wrote: > yes I tried doing that but that doesn't work. > > I'm looking at using SQLContext and dataframes. Is SQLCOntext serializable? > > On Mon, Jan 18, 2016 at 1:29 PM, Ted Yu <yuzhih...@gmail.com> wrote: >

Re: SQL UDF problem (with re to types)

2016-01-17 Thread Ted Yu
te: >>> >>>> So, when I try BigDecimal, it works. But, should it not parse based on >>>> what the UDF defines? Am I missing something here? >>>> >>>> On Wed, Jan 13, 2016 at 4:57 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>&

Re: How to tunning my spark application.

2016-01-17 Thread Ted Yu
In sampleArray(), there is a loop: for (i <- 0 until ARRAY_SAMPLE_SIZE) { ARRAY_SAMPLE_SIZE is a constant (100). Not clear how the amount of computation in sampleArray() can be reduced. Which Spark release are you using ? Thanks On Sun, Jan 17, 2016 at 6:22 AM, 张峻

Re: How to tunning my spark application.

2016-01-17 Thread Ted Yu
; BR > > Julian Zhang > > 在 2016年1月17日,23:10,Ted Yu <yuzhih...@gmail.com> 写道: > > In sampleArray(), there is a loop: > for (i <- 0 until ARRAY_SAMPLE_SIZE) { > > ARRAY_SAMPLE_SIZE is a constant (100). > > Not clear how the amount of computation in samp

Re: Sending large objects to specific RDDs

2016-01-16 Thread Ted Yu
that I should consider should I approach it this way? > > Thank you for your help, > > Daniel > > On Fri, Jan 15, 2016 at 5:30 PM Ted Yu <yuzhih...@gmail.com> wrote: > >> My knowledge of XSEDE is limited - I visited the website. >> >> If there i

Re: spark job server

2016-01-16 Thread Ted Yu
Which distro are you using ? >From the error message, compute-classpath.sh was not found. I searched Spark 1.6 built for hadoop 2.6 but didn't find either compute-classpath.sh or server_start.sh Cheers On Sat, Jan 16, 2016 at 5:33 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: >

Re: spark source Intellij

2016-01-15 Thread Ted Yu
See: http://search-hadoop.com/m/q3RTtZbuxxp9p6N1=Re+Best+IDE+Configuration > On Jan 15, 2016, at 2:19 AM, Sanjeev Verma wrote: > > I want to configure spark source code into Intellij IDEA Is there any > document available / known steps which can guide me to

Re: Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

2016-01-15 Thread Ted Yu
bq. check application tracking page:http://slave1:8088/proxy/application_1452763526769_0011/ Then , ... Have you done the above to see what error was in each attempt ? Which Spark / hadoop release are you using ? Thanks On Fri, Jan

Re: Serialization stack error

2016-01-15 Thread Ted Yu
Here is signature for Get: public class Get extends Query implements Row, Comparable { It is not Serializable. FYI On Fri, Jan 15, 2016 at 6:37 AM, beeshma r wrote: > HI i am trying to get data from Solr server . > > This is my code > > /*input is JavaRDD li >

Re: Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

2016-01-15 Thread Ted Yu
a:105) > > at > time.series.wo.agg.InputStreamSpark.main(InputStreamSpark.java:38) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > >

Re: Serialization stack error

2016-01-15 Thread Ted Yu
xception > { > Get get = null; > > for (SolrDocument doc : si) { > get = new Get(Bytes.toBytes(((String) > doc.getFieldValue("id"; > > } > > return get; > >

Re: Sending large objects to specific RDDs

2016-01-15 Thread Ted Yu
.@gmail.com> wrote: > >> Thank you Ted! That sounds like it would probably be the most efficient >> (with the least overhead) way of handling this situation. >> >> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Another a

Re: Compiling only MLlib?

2016-01-15 Thread Ted Yu
Looks like you didn't have zinc running. Take a look at install_zinc() in build/mvn, around line 83. You can use build/mvn instead of running mvn directly. I normally use the following command line: bin/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 package

Re: Executor initialize before all resources are ready

2016-01-15 Thread Ted Yu
Which Spark release are you using ? Thanks On Fri, Jan 15, 2016 at 7:08 PM, Byron Wang wrote: > Hi, I am building metrics system for Spark Streaming job, in the system, > the > metrics are collected in each executor, so a metrics source (a class used > to > collect metrics)

Re: Spark and HBase RDD join/get

2016-01-14 Thread Ted Yu
For #1, yes it is possible. You can find some example in hbase-spark module of hbase where hbase as DataSource is provided. e.g. https://github.com/apache/hbase/blob/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseRDDFunctions.scala Cheers On Thu, Jan 14, 2016 at 5:04 AM,

Re: code hangs in local master mode

2016-01-14 Thread Ted Yu
Can you capture one or two stack traces of the local master process and pastebin them ? Thanks On Thu, Jan 14, 2016 at 6:01 AM, Kai Wei wrote: > Hi list, > > I ran into an issue which I think could be a bug. > > I have a Hive table stored as parquet files. Let's say it's

Re: Concurrent Read of Accumulator's Value

2016-01-13 Thread Ted Yu
One option is to use a NoSQL data store, such as hbase, for the two actions to exchange status information. Write to data store in action 1 and read from action 2. Cheers On Wed, Jan 13, 2016 at 2:20 AM, Kira wrote: > Hi, > > So i have an action on one RDD that is

Re: SQL UDF problem (with re to types)

2016-01-13 Thread Ted Yu
Looks like BigDecimal was passed to your call() method. Can you modify your udf to see if using BigDecimal works ? Cheers On Wed, Jan 13, 2016 at 11:58 AM, raghukiran wrote: > While registering and using SQL UDFs, I am running into the following > problem: > > UDF

Re: Sending large objects to specific RDDs

2016-01-13 Thread Ted Yu
Another approach is to store the objects in NoSQL store such as HBase. Looking up object should be very fast. Cheers On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman wrote: > I'm looking for a way to send structures to pre-determined partitions so > that > they

Re: SQL UDF problem (with re to types)

2016-01-13 Thread Ted Yu
Please take a look at sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleSum.java which shows a UserDefinedAggregateFunction that works on DoubleType column. sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java shows how it is registered. Cheers On Wed, Jan

Re: How to get the working directory in executor

2016-01-13 Thread Ted Yu
Can you place metrics.properties and datainsights-metrics-source-assembly-1.0.jar on hdfs ? Cheers On Wed, Jan 13, 2016 at 8:01 AM, Byron Wang wrote: > I am using the following command to submit Spark job, I hope to send jar > and > config files to each executor and load it

Re: How to get the working directory in executor

2016-01-13 Thread Ted Yu
In a bit more detail: You upload the files using 'hdfs dfs -copyFromLocal' command Then specify hdfs location of the files on the command line. Cheers On Wed, Jan 13, 2016 at 8:05 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Can you place metrics.properties and > datainsights-me

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Ted Yu
Can you show the complete stack trace for the error ? I searched 1.6.0 code base but didn't find the class SparkSubmitDriverBootstrapper Thanks On Wed, Jan 13, 2016 at 9:31 AM, Lin Zhao wrote: > My job runs fine in yarn cluster mode but I have reason to use client mode >

Re: failure to parallelize an RDD

2016-01-12 Thread Ted Yu
Which release of Spark are you using ? Can you turn on DEBUG logging to see if there is more clue ? Thanks On Tue, Jan 12, 2016 at 6:37 PM, AlexG wrote: > I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array > of > rows in Array[Array[Float]] format

Re: Windows driver cannot run job on Linux cluster

2016-01-11 Thread Ted Yu
Which release of Spark are you using ? Can you pastebin stack trace of executor(s) so that we can have some more clue ? Thanks On Mon, Jan 11, 2016 at 1:10 PM, Andrew Wooster wrote: > I have a very simple program that runs fine on my Linux server that runs > Spark

Re: partitioning RDD

2016-01-11 Thread Ted Yu
Hi, Please use proper subject when sending email to user@ In your example below, what do the values inside curly braces represent ? I assume not the keys since values for same key should go to the same partition. Cheers On Mon, Jan 11, 2016 at 10:51 AM, Daniel Imberman

Re: Best IDE Configuration

2016-01-10 Thread Ted Yu
For python, there is https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8 which was mentioned in http://search-hadoop.com/m/q3RTt2Eu941D9H9t1 FYI On Sat, Jan 9, 2016 at 11:24 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Please take a look at: > https://cwiki.apache.org/confluence/d

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Ted Yu
s LZO.Is it LZO or LZ4? >> >> https://github.com/Cyan4973/lz4 >> >> Based on this benchmark, they differ quite a lot. >> >> >> >>> On Fri, Jan 8, 2016 at 9:55 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> gzip is relatively

Re: Best IDE Configuration

2016-01-09 Thread Ted Yu
Please take a look at: https://cwiki.apache.org/confluence/display/SPARK/ Useful+Developer+Tools#UsefulDeveloperTools-IDESetup On Sat, Jan 9, 2016 at 11:16 AM, Jorge Machado wrote: > Hello everyone, > > > I´m just wondering how do you guys develop for spark. > > For example I

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Ted Yu
ot;) FYI On Fri, Jan 8, 2016 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > ok thanks so it will be enabled by default always if yes then in > documentation why default shuffle manager is mentioned as sort? > > On Sat, Jan 9, 2016 at

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Ted Yu
>From sql/core/src/main/scala/org/apache/spark/sql/execution/commands.scala : case Some((SQLConf.Deprecated.TUNGSTEN_ENABLED, Some(value))) => val runFunc = (sqlContext: SQLContext) => { logWarning( s"Property ${SQLConf.Deprecated.TUNGSTEN_ENABLED} is deprecated and "

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
Is your Parquet data source partitioned by date ? Can you dedup within partitions ? Cheers On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue wrote: > I tried on Three day's data. The total input is only 980GB, but the > shuffle write Data is about 6.2TB, then the job failed

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
y day's incoming Event data having duplicates among each >>> other. One same event could show up in Day1 and Day2 and probably Day3. >>> >>> I only want to keep single Event table and each day it come so many >>> duplicates. >>> >>> Is there

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
mance > between gzip and snappy? And why parquet is using gzip by default? > > Thanks. > > > On Fri, Jan 8, 2016 at 6:39 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Cycling old bits: >> http://search-hadoop.com/m/q3RTtRuvrm1CGzBJ >> >> Gavin:

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
>>>> >>>>> hey Ted, >>>>> >>>>> Event table is like this: UserID, EventType, EventKey, TimeStamp, >>>>> MetaData. I just parse it from Json and save as Parquet, did not change >>>>> the partition. >>>>

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
e number of reducers for joins and >groupbys: Currently in Spark SQL, you need to control the degree of >parallelism post-shuffle using “SET >spark.sql.shuffle.partitions=[num_tasks];”. > > Thanks. > > Gavin > > > > > On Fri, Jan 8, 2016 at 6:25 PM, Ted Yu

Re: write new data to mysql

2016-01-08 Thread Ted Yu
Which Spark release are you using ? For case #2, was there any error / clue in the logs ? Cheers On Fri, Jan 8, 2016 at 7:36 AM, Yasemin Kaya wrote: > Hi, > > I want to write dataframe existing mysql table, but when i use >

Re: Kryo serializer Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException:

2016-01-08 Thread Ted Yu
bq. try adding scala.collection.mutable.WrappedArray But the hint said registering scala.collection.mutable.WrappedArray$ofRef.class , right ? On Fri, Jan 8, 2016 at 8:52 AM, jiml wrote: > (point of post is to see if anyone has ideas about errors at end of post) > >

Re: Window Functions importing issue in Spark 1.4.0

2016-01-07 Thread Ted Yu
Please take a look at the following for sample on how rowNumber is used: https://github.com/apache/spark/pull/9050 BTW 1.4.0 was an old release. Please consider upgrading. On Thu, Jan 7, 2016 at 3:04 AM, satish chandra j wrote: > HI All, > Currently using Spark 1.4.0

Re: spark ui security

2016-01-07 Thread Ted Yu
According to https://spark.apache.org/docs/latest/security.html#web-ui , web UI is covered. FYI On Thu, Jan 7, 2016 at 6:35 AM, Kostiantyn Kudriavtsev < kudryavtsev.konstan...@gmail.com> wrote: > hi community, > > do I understand correctly that spark.ui.filters property sets up filters > only

Re: spark ui security

2016-01-07 Thread Ted Yu
urable per job, so I assume to >> protect WebUI the different place must be used, isn’t it? >> >> On Jan 7, 2016, at 10:28 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> According to https://spark.apache.org/docs/latest/security.html#web-ui , >> web UI is cov

Re: problem building spark on centos

2016-01-06 Thread Ted Yu
w.r.t. the second error, have you read this ? http://www.captaindebug.com/2013/03/mavens-non-resolvable-parent-pom-problem.html#.Vo1fFGSrSuo On Wed, Jan 6, 2016 at 9:49 AM, Jade Liu <jade@nor1.com> wrote: > I’m using 3.3.9. Thanks! > > Jade > > From: Ted Yu <yuz

Re: How to insert df in HBASE

2016-01-06 Thread Ted Yu
Cycling prior discussion: http://search-hadoop.com/m/q3RTtX7POh17hqdj1 On Wed, Jan 6, 2016 at 3:07 AM, Sadaf wrote: > HI, > > I need to insert a Dataframe in to hbase using scala code. > Can anyone guide me how to achieve this? > > Any help would be much appreciated. >

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Ted Yu
Try "git grep -i spark.memory.offheap.size"... > > On Wed, Jan 6, 2016 at 2:45 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > Maybe I looked in the wrong files - I searched *.scala and *.java files > (in > > latest Spark 1.6.0 RC) for '.offheap.' but didn't fin

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Ted Yu
Maybe I looked in the wrong files - I searched *.scala and *.java files (in latest Spark 1.6.0 RC) for '.offheap.' but didn't find the config. Can someone enlighten me ? Thanks On Wed, Jan 6, 2016 at 2:35 PM, Jakob Odersky wrote: > Check the configuration guide for a

Re: org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-06 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtAiQta22XrCI On Wed, Jan 6, 2016 at 8:41 PM, Jia Zou wrote: > Dear all, > > I am using Spark1.5.2 and Tachyon0.7.1 to run KMeans with > inputRDD.persist(StorageLevel.OFF_HEAP()). > > I've set tired storage for

Re: Spark Token Expired Exception

2016-01-06 Thread Ted Yu
Which Spark / hadoop release are you using ? Thanks On Wed, Jan 6, 2016 at 12:16 PM, Nikhil Gs wrote: > Hello Team, > > > Thank you for your time in advance. > > > Below are the log lines of my spark job which is used for consuming the > messages from Kafka Instance

Re: Is there a way to use parallelize function in sparkR spark version (1.6.0)

2016-01-05 Thread Ted Yu
Please take a look at the following for examples: R/pkg/R/RDD.R R/pkg/R/pairRDD.R Cheers On Tue, Jan 5, 2016 at 2:36 AM, Chandan Verma wrote: > >

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Ted Yu
+1 > On Jan 5, 2016, at 10:49 AM, Davies Liu wrote: > > +1 > > On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas > wrote: >> +1 >> >> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python >> 2.6 is ancient history and

Re: aggregateByKey vs combineByKey

2016-01-05 Thread Ted Yu
Looking at PairRDDFunctions.scala : def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)] = self.withScope { ... combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp,

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Ted Yu
Which version of Spark do you use ? This might be related: https://issues.apache.org/jira/browse/SPARK-8560 Do you use dynamic allocation ? Cheers > On Jan 4, 2016, at 10:05 PM, Prasad Ravilla wrote: > > I am seeing negative active tasks in the Spark UI. > > Is anyone

Re: problem building spark on centos

2016-01-05 Thread Ted Yu
Which version of maven are you using ? It should be 3.3.3+ On Tue, Jan 5, 2016 at 4:54 PM, Jade Liu wrote: > Hi, All: > > I’m trying to build spark 1.5.2 from source using maven with the following > command: > > ./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Ted Yu
Something like the following: val zeroValue = collection.mutable.Set[String]() val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v, (setOne, setTwo) => setOne ++= setTwo) On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue wrote: > Hey, > > For example, a table

Re: HiveThriftServer fails to quote strings

2016-01-04 Thread Ted Yu
bq. without any of the escape characters: Did you intend to show some sample ? As far as I can tell, there was no sample or image in previous email. FYI On Mon, Jan 4, 2016 at 11:36 AM, sclyon wrote: > Hello all, > > I've got a nested JSON structure in parquet format

Re: Is Spark 1.6 released?

2016-01-04 Thread Ted Yu
Please refer to the following: https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Cheers On Mon, Jan 4, 2016 at

Re: Monitor Job on Yarn

2016-01-04 Thread Ted Yu
Please look at history server related content under: https://spark.apache.org/docs/latest/running-on-yarn.html Note spark.yarn.historyServer.address FYI On Mon, Jan 4, 2016 at 2:49 PM, Daniel Valdivia wrote: > Hello everyone, happy new year, > > I submitted an app to

Re: groupByKey does not work?

2016-01-04 Thread Ted Yu
Can you give a bit more information ? Release of Spark you're using Minimal dataset that shows the problem Cheers On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra wrote: > I tried groupByKey and noticed that it did not group all values into the > same group. > > In my test

Re: Apparent bug in KryoSerializer

2015-12-31 Thread Ted Yu
For your second question, bq. Class is not registered: scala.Tuple3[] The above IllegalArgumentException has stated the class Scala was expecting registration. Meaning the type of components in the tuple is insignificant. BTW what Spark release are you using ? Cheers On Thu, Dec 31, 2015 at

Re: pass custom spark-conf

2015-12-31 Thread Ted Yu
Check out --conf option for spark-submit bq. to configure different hdfs-site.xml What config parameters do you plan to change in hdfs-site.xml ? If the parameter only affects hdfs NN / DN, passing hdfs-site.xml wouldn't take effect, right ? Cheers On Thu, Dec 31, 2015 at 10:48 AM, KOSTIANTYN

Re: difference between ++ and Union of a RDD

2015-12-29 Thread Ted Yu
>From RDD.scala : def ++(other: RDD[T]): RDD[T] = withScope { this.union(other) They should be the same. On Tue, Dec 29, 2015 at 10:41 AM, email2...@gmail.com wrote: > Hello All - > > tried couple of operations by using ++ and union on RDD's but realized that > the

Re: Task hang problem

2015-12-29 Thread Ted Yu
Can you log onto 10.65.143.174 , find task 31 and take a stack trace ? Thanks On Tue, Dec 29, 2015 at 9:19 AM, Darren Govoni wrote: > Hi, > I've had this nagging problem where a task will hang and the entire job > hangs. Using pyspark. Spark 1.5.1 > > The job output

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-29 Thread Ted Yu
Have you searched log for 'f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4' ? In the snippet you posted, I don't see registration of this Executor. Cheers On Tue, Dec 29, 2015 at 12:43 PM, Adrian Bridgett wrote: > We're seeing an "Executor is not registered" error on a Spark

Re: difference between ++ and Union of a RDD

2015-12-29 Thread Ted Yu
; Gokula Krishnan* (Gokul)* > > On Tue, Dec 29, 2015 at 1:43 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> From RDD.scala : >> >> def ++(other: RDD[T]): RDD[T] = withScope { >> this.union(other) >> >> They should be the same. >> >&g

Re: Can't submit job to stand alone cluster

2015-12-28 Thread Ted Yu
Have you verified that the following file does exist ? /home/hadoop/git/scalaspark/./target/scala-2.10/cluster- incidents_2.10-1.0.jar Thanks On Mon, Dec 28, 2015 at 3:16 PM, Daniel Valdivia wrote: > Hi, > > I'm trying to submit a job to a small spark cluster running

Re: Inconsistent behavior of randomSplit in YARN mode

2015-12-28 Thread Ted Yu
bq. the train and test have overlap in the numbers being outputted Can the call to repartition explain the above ? Which release of Spark are you using ? Thanks On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar wrote: > Hi, > > I noticed an inconsistent behavior when

Re: Pattern type is incompatible with expected type

2015-12-27 Thread Ted Yu
Have you tried declaring RDD[ChildTypeOne] and writing separate functions for each sub-type ? Cheers On Sun, Dec 27, 2015 at 10:08 AM, pkhamutou wrote: > Hello, > > I have a such situation: > > abstract class SuperType {...} > case class ChildTypeOne(x: String) extends

Re: partitioning json data in spark

2015-12-27 Thread Ted Yu
Is upgrading to 1.5.x a possibility for you ? Cheers On Sun, Dec 27, 2015 at 9:28 AM, Նարեկ Գալստեան wrote: > > http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter > I did try but it all was in vain. > It is also explicitly

Re: DataFrame Save is writing just column names while saving

2015-12-27 Thread Ted Yu
Can you confirm that file1df("COLUMN2") and file2df("COLUMN10") appeared in the output of joineddf.collect.foreach(println) ? Thanks On Sun, Dec 27, 2015 at 6:32 PM, Divya Gehlot wrote: > Hi, > I am trying to join two dataframes and able to display the results in the

Re: ERROR server.TThreadPoolServer: Error occurred during processing of message

2015-12-26 Thread Ted Yu
Have you seen this ? http://stackoverflow.com/questions/30705576/python-cannot-connect-hiveserver2 On Sat, Dec 26, 2015 at 9:09 PM, Dasun Hegoda wrote: > I'm running apache-hive-1.2.1-bin and spark-1.5.1-bin-hadoop2.6. spark as > the hive engine. When I try to connect

Re: error while defining custom schema in Spark 1.5.0

2015-12-25 Thread Ted Yu
The error was due to blank field being defined twice. On Tue, Dec 22, 2015 at 12:03 AM, Divya Gehlot wrote: > Hi, > I am new bee to Apache Spark ,using CDH 5.5 Quick start VM.having spark > 1.5.0. > I working on custom schema and getting error > > import

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Ted Yu
DataFrame uses different syntax from SQL query. I searched unit tests but didn't find any in the form of df.select("select ...") Looks like you should use sqlContext as other people suggested. On Fri, Dec 25, 2015 at 8:29 AM, Eugene Morozov wrote: > Thanks for the

Re: error in spark cassandra connector

2015-12-24 Thread Ted Yu
Mind providing a bit more detail ? Release of Spark version of Cassandra connector How job was submitted complete stack trace Thanks On Thu, Dec 24, 2015 at 2:06 AM, Vijay Kandiboyina wrote: > java.lang.NoClassDefFoundError: >

Re: How to contribute by picking up starter bugs

2015-12-24 Thread Ted Yu
You can send out pull request for the JIRA you're interested in. Start the title of pull request with: [SPARK-XYZ] ... where XYZ is the JIRA number. The pull request would be posted on the JIRA. After pull request is reviewed, tested by QA and merged, the committer would assign your name to the

Re: rdd split into new rdd

2015-12-23 Thread Ted Yu
bq. {a=1, b=1, c=2, d=2} Can you elaborate your criteria a bit more ? The above seems to be a Set, not a Map. Cheers On Wed, Dec 23, 2015 at 7:11 AM, Yasemin Kaya wrote: > Hi, > > I have data > *JavaPairRDD> *format. In example: > > *(1610,

Re: error creating custom schema

2015-12-23 Thread Ted Yu
Looks like a comma was missing after "C1" Cheers > On Dec 23, 2015, at 1:47 AM, Divya Gehlot wrote: > > Hi, > I am trying to create custom schema but its throwing below error > > >> scala> import org.apache.spark.sql.hive.HiveContext >> import

Re: Classification model method not found

2015-12-22 Thread Ted Yu
Looks like you should define ctor for ExtendedLR which accepts String (the uid). Cheers On Tue, Dec 22, 2015 at 1:04 PM, njoshi wrote: > Hi, > > I have a custom extended LogisticRegression model which I want to test > against a parameter grid search. I am running as

Re: Can SqlContext be used inside mapPartitions

2015-12-22 Thread Ted Yu
bq. be able to lookup from inside MapPartitions based on a key Please describe your use case in bit more detail. One possibility is to use NoSQL database such as HBase. There're several choices for Spark HBase connector. Cheers On Tue, Dec 22, 2015 at 4:51 PM, Zhan Zhang

Re: Which Hive version should be used with Spark 1.5.2?

2015-12-22 Thread Ted Yu
Please see SPARK-8064 On Tue, Dec 22, 2015 at 6:17 PM, Arthur Chan wrote: > Hi, > > I plan to upgrade from 1.4.1 (+ Hive 1.1.0) to 1.5.2, is there any > upgrade document available about the upgrade especially which Hive version > should be upgraded too? > > Regards >

Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Ted Yu
This should be related: https://issues.apache.org/jira/browse/SPARK-4170 On Tue, Dec 22, 2015 at 9:34 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a standalone cluster. One Master + One Slave. I'm getting below > "NULL POINTER" exception. > > Could you please help

Re: Regarding spark in nemory

2015-12-22 Thread Ted Yu
If I understand your question correctly, the answer is yes. You can retrieve rows of the rdd which are distributed across the nodes. > On Dec 22, 2015, at 7:34 PM, Gaurav Agarwal wrote: > > If I have 3 more cluster and spark is running there .if I load the records >

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Ted Yu
This might be related but the jmap output there looks different: http://stackoverflow.com/questions/32537965/huge-number-of-io-netty-buffer-poolthreadcachememoryregioncacheentry-instances On Tue, Dec 22, 2015 at 2:59 AM, Antony Mayi wrote: > I have streaming app

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Ted Yu
ess options and so far it > seems this has dramatically improved, the finalization looks to be keeping > up and the heap is stable. > > Any input is still welcome! > > > On Tuesday, 22 December 2015, 12:17, Ted Yu <yuzhih...@gmail.com> wrote: > > > > This might be

Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Ted Yu
Which Spark release are you using ? Cheers On Tue, Dec 22, 2015 at 9:34 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a standalone cluster. One Master + One Slave. I'm getting below > "NULL POINTER" exception. > > Could you please help me on this issue. > > >

Re: rdd only with one partition

2015-12-21 Thread Ted Yu
Have you tried the following method ? * Note: With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000,

Re: Writing output fails when spark.unsafe.offHeap is enabled

2015-12-21 Thread Ted Yu
w.r.t. at org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:202) I looked at UnsafeExternalRowSorter.java in 1.6.0 which only has 192 lines of code. Can you run with latest RC of 1.6.0 and paste the stack trace ? Thanks On Thu, Dec 17,

Re: rdd only with one partition

2015-12-21 Thread Ted Yu
also lose spark > parallelism benefit . > > Best Wishes! > Zhiliang > > > > > On Monday, December 21, 2015 11:17 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Have you tried the following method ? > >* Note: With shuffle = true, you can actually

Re: How to convert and RDD to DF?

2015-12-20 Thread Ted Yu
See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType) method: * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using the given schema. * It is important to make sure that the structure of every [[Row]] of the provided RDD matches * the provided schema.

Re: Getting an error in insertion to mysql through sparkcontext in java..

2015-12-20 Thread Ted Yu
Was there stack trace following the error ? Which Spark release are you using ? Cheers > On Dec 19, 2015, at 10:43 PM, Sree Eedupuganti wrote: > > i had 9 rows in my Mysql table > > > options.put("dbtable", "(select * from employee"); >options.put("lowerBound",

Re: spark 1.5.2 memory leak? reading JSON

2015-12-19 Thread Ted Yu
The 'Failed to parse a value' was the cause for execution failure. Can you disclose the structure of your json file ? Maybe try latest 1.6.0 RC to see if the problem goes away. Thanks On Sat, Dec 19, 2015 at 1:55 PM, Eran Witkon wrote: > Hi, > I tried the following code

Re: how to fetch all of data from hbase table in spark java

2015-12-19 Thread Ted Yu
Please take a look at: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala There're various hbase connectors (search for 'apache spark hbase connector') In hbase 2.0, there would be hbase-spark module which provides hbase connector. FYI On Fri, Dec 18, 2015 at 11:56 PM, Sateesh

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Ted Yu
CacheManager#cacheQuery() is called where: * Caches the data produced by the logical representation of the given [[Queryable]]. ... val planToCache = query.queryExecution.analyzed if (lookupCachedData(planToCache).nonEmpty) { Is the schema for dfNew different from that of dfOld ?

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Ted Yu
gt;>>> table1.registerTempTable("table1") >>>> table1.cache() >>>> table1.count() >>>> >>>> and if I do a self join on table1 things are quite fine >>>> >>>> But in case we have something like this: >>>> table1 =

Re: Spark with log4j

2015-12-18 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtEor1vYWbsW which mentioned: SPARK-11105 Disitribute the log4j.properties files from the client to the executors FYI On Fri, Dec 18, 2015 at 7:23 AM, Kalpesh Jadhav < kalpesh.jad...@citiustech.com> wrote: > Hi all, > > > > I am new to spark, I am

<    2   3   4   5   6   7   8   9   10   11   >