Re: Spark SQL: org.apache.spark.sql.AnalysisException: cannot resolve "some columns" given input columns.

2016-06-07 Thread Ted Yu
Please see: [SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource FYI On Tue, Jun 7, 2016 at 10:18 AM, Jerry Wong wrote: > Hi, > > Two JSON files but one of them miss some columns, like > > {"firstName": "Jack", "lastName":

Re: Integrating spark source in an eclipse project?

2016-06-07 Thread Ted Yu
Please see: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup Use proper branch. FYI On Tue, Jun 7, 2016 at 9:04 AM, Cesar Flores wrote: > > I created a spark application in Eclipse by including the >

Re: Spark_Usecase

2016-06-07 Thread Ted Yu
bq. load the data from edge node to hdfs Does the loading involve accessing sqlserver ? Please take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni wrote: > Hi > how about > > 1. have a process that

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Ted Yu
ems to occur at the time of starting Jetty HTTPServer. > > Can you please point me to resources that help me understand how security > is managed in Spark and how changing from java 7 to 8 can mess up these > configurations? > > > Thank you! > > On Mon, Jun 6, 2016 at

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Ted Yu
Have you seen this ? http://stackoverflow.com/questions/22423063/java-exception-on-sslsocket-creation On Mon, Jun 6, 2016 at 12:31 PM, verylucky Man wrote: > Hi, > > I have a cluster (Hortonworks supported system) running Apache spark on > 1.5.2 on Java 7, installed by

Re: groupByKey returns an emptyRDD

2016-06-06 Thread Ted Yu
Can you give us a bit more information ? how you packaged the code into jar command you used for execution version of Spark related log snippet Thanks On Mon, Jun 6, 2016 at 10:43 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm wrapped the following code into a jar: > >

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ted Yu
; > and what do I need to add to the top of scala app like > > import java.io.File > import org.apache.log4j.Logger > import org.apache.log4j.Level > import ? > > Thanks > > > > > > On Sunday, 5 June 2016, 15:21, Ted Yu <yuzhih...@gmail.com> wrote: >

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ted Yu
At compilation time, you need to declare the dependence on getCheckpointDirectory. At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass the jar. Cheers On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar wrote: > Hi all, > > Appreciate any advice

Re: Scheduler Delay Time

2016-06-03 Thread Ted Yu
Mind using a different site for your images ? I clicked on each of the 3 links but none of them shows up. FYI On Fri, Jun 3, 2016 at 9:36 AM, alvarobrandon wrote: > Hello: > > I'm doing some instrumentation in Spark and I've realised that some of my > tasks take

Re: np.unique and collect

2016-06-03 Thread Ted Yu
Where is np defined ? Thanks On Fri, Jun 3, 2016 at 6:07 AM, pseudo oduesp wrote: > Hi , > why np.unique return list instead of list in this function ? > > def unique_item_df(df,list_var): > > l = df.select(list_var).distinct().collect() > return np.unique(l) >

Re: Stream reading from database using spark streaming

2016-06-02 Thread Ted Yu
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/ https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD FYI On Thu, Jun 2, 2016 at 6:26 AM, Zakaria Hili wrote: > I want to use spark streaming to

Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Ted Yu
ched the error file to this > mail, please have a look at it. > > Thanks > > On Thu, Jun 2, 2016 at 11:51 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Can you show the error in bit more detail ? >> >> Which release of hadoop / Spark are you using ? >>

Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Ted Yu
Can you show the error in bit more detail ? Which release of hadoop / Spark are you using ? Is CapacityScheduler being used ? Thanks On Thu, Jun 2, 2016 at 1:32 AM, Prabeesh K. wrote: > Hi I am using the below command to run a spark job and I get an error like >

Re: Switching broadcast mechanism from torrrent

2016-06-01 Thread Ted Yu
I found spark.broadcast.blockSize but no parameter to switch broadcast method. Can you describe the issues with torrent broadcast in more detail ? Which version of Spark are you using ? Thanks On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > Our

Re: Map tuple to case class in Dataset

2016-05-31 Thread Ted Yu
Using spark-shell of 1.6.1 : scala> case class Test(a: Int) defined class Test scala> Seq(1,2).toDS.map(t => Test(t)).show +---+ | a| +---+ | 1| | 2| +---+ FYI On Tue, May 31, 2016 at 7:35 PM, Tim Gautier wrote: > 1.6.1 The exception is a null pointer exception.

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ted Yu
8, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, >> 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, 768, >> 768, 768, 768, 768, 768, 768, 768, 768, 768, 828, 896, 896, 896, 896, 896, >> 896, 896, 896, 896, 896, 896, 896, 850, 786, 768, 768, 768,

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ted Yu
Value for shuffle is false by default. Have you tried setting it to true ? Which Spark release are you using ? On Tue, May 31, 2016 at 6:13 AM, Maciej Sokołowski wrote: > Hello Spark users and developers. > > I read file and want to ensure that it has exact number of

Re: 答复: G1 GC takes too much time

2016-05-29 Thread Ted Yu
nostivVMOptions > -XX:G1SummarizeConcMark > -XX:InitiatingHeapOccupancyPercent=35 > spark.executor.memory=4G > > ---------- > *发件人:* Ted Yu <yuzhih...@gmail.com> > *发送时间:* 2016年5月30日 9:47:05 > *收件人:* condor join > *抄送:* user@spark.apache.org > *主题

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Ted Yu
//table.setWriteBufferSize(8388608) > > itr.grouped(100).foreach(table.put(_)) // << Exception happens at > this point > > table.close() > > } > > } > > > > I am using hbase 0.98.12 mapr di

Re: Accessing s3a files from Spark

2016-05-29 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTthWU8o1MbFC2=Re+Forbidded+Error+Code+403 On Sun, May 29, 2016 at 2:55 PM, Mayuresh Kunjir wrote: > I'm running into permission issues while accessing data in S3 bucket > stored using s3a file system from a local

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Ted Yu
bq. at com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$ anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80) Can you reveal related code from HbaseUtils.scala ? Which hbase version are you using ? Thanks On Sun, May 29, 2016 at 4:26 PM, Nirav Patel wrote: > Hi, >

Re: join function in a loop

2016-05-28 Thread Ted Yu
> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 29 May 2016 at 00:26, heri wijayanto <heri0...@gmail.com> wrote: >>> >>>> I implement spark with join function for processing in around 250 >

Re: join function in a loop

2016-05-28 Thread Ted Yu
Can you let us know your case ? When the join failed, what was the error (consider pastebin) ? Which release of Spark are you using ? Thanks > On May 28, 2016, at 3:27 PM, heri wijayanto wrote: > > Hi everyone, > I perform join function in a loop, and it is failed. I

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Ted Yu
Sujeet: Please also see: https://spark.apache.org/docs/latest/spark-standalone.html On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh wrote: > Hi Sujeet, > > if you have a single machine then it is Spark standalone mode. > > In Standalone cluster mode Spark allocates

Re: Undocumented left join constraint?

2016-05-27 Thread Ted Yu
Which release did you use ? I tried your example in master branch: scala> val test2 = Seq(Test(2), Test(3), Test(4)).toDS test2: org.apache.spark.sql.Dataset[Test] = [id: int] scala> test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id", "left_outer").show +---+--+ | _1|_2|

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Ted Yu
have >>>> anything to do with the data itself. It has to do with how the Dataset was >>>> created. Both datasets have exactly the same data in them, but the one >>>> created from a sql query fails where the one created from a Seq works. The >>>>

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Ted Yu
Which release of Spark are you using ? Is it possible to come up with fake data that shows what you described ? Thanks On Fri, May 27, 2016 at 8:24 AM, Tim Gautier wrote: > Unfortunately I can't show exactly the data I'm using, but this is what > I'm seeing: > > I have

Re: Pros and Cons

2016-05-27 Thread Ted Yu
Teng: Why not try out the 2.0 SANPSHOT build ? Thanks > On May 27, 2016, at 7:44 AM, Teng Qiu wrote: > > ah, yes, the version is another mess!... no vendor's product > > i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work. > > hadoop 2.6.2, hive 2.0.1 with

Re: Spark Job Execution halts during shuffle...

2016-05-26 Thread Ted Yu
Priya: Have you checked the executor logs on hostname1 and hostname2 ? Cheers On Thu, May 26, 2016 at 8:00 PM, Takeshi Yamamuro wrote: > Hi, > > If you get stuck in job fails, one of best practices is to increase > #partitions. > Also, you'd better off using DataFrame

Re: Subtract two DataFrames is not working

2016-05-26 Thread Ted Yu
Can you be a bit more specific about how they didn't work ? BTW 1.4.1 seems to be an old release. Please try 1.6.1 if possible. Cheers On Thu, May 26, 2016 at 9:44 AM, Gurusamy Thirupathy wrote: > I have to subtract two dataframes, I tried with except method but it's not

Re: save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-26 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtWmyYB5fweR=Re+Best+way+to+store+Avro+Objects+as+Parquet+using+SPARK On Thu, May 26, 2016 at 6:55 AM, Govindasamy, Nagarajan < ngovindas...@turbine.com> wrote: > Hi, > > I am trying to save RDD of Avro GenericRecord as parquet. I am

Re: Using Java in Spark shell

2016-05-25 Thread Ted Yu
I found this: :javap disassemble a file or class name But no direct interpretation of Java code. On Tue, May 24, 2016 at 10:11 PM, Ashok Kumar wrote: > Hello, > > A newbie question. > > Is it possible to use java code directly in spark shell

Re: How does Spark set task indexes?

2016-05-24 Thread Ted Yu
Have you taken a look at SPARK-14915 ? On Tue, May 24, 2016 at 1:00 PM, Adrien Mogenet < adrien.moge...@contentsquare.com> wrote: > Hi, > > I'm wondering how Spark is setting the "index" of task? > I'm asking this question because we have a job that constantly fails at > task index = 421. > >

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Ted Yu
Can you describe the kerberos issues in more detail ? Which release of YARN are you using ? Cheers On Mon, May 23, 2016 at 4:41 AM, Chandraprakash Bhagtani < cpbhagt...@gmail.com> wrote: > Hi, > > My Spark job is failing with kerberos issues while creating hive context > in yarn-cluster mode.

Re: Handling Empty RDD

2016-05-22 Thread Ted Yu
You mean when rdd.isEmpty() returned false, saveAsTextFile still produced empty file ? Can you show code snippet that demonstrates this ? Cheers On Sun, May 22, 2016 at 5:17 AM, Yogesh Vyas wrote: > Hi, > I am reading files using textFileStream, performing some action

Re: How to set the degree of parallelism in Spark SQL?

2016-05-21 Thread Ted Yu
Looks like an equal sign is missing between partitions and 200. On Sat, May 21, 2016 at 8:31 PM, SRK wrote: > Hi, > > How to set the degree of parallelism in Spark SQL? I am using the following > but it somehow seems to allocate only two executors at a time. > >

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Ted Yu
In spark-shell: scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala> var hc: HiveContext = new HiveContext(sc) FYI On Sat, May 21, 2016 at 8:11 AM, Sri wrote: > Hi , > > You mean hive-site.xml file right ?,I did

Re: Spark Streaming S3 Error

2016-05-21 Thread Ted Yu
Maybe more than one version of jets3t-xx.jar was on the classpath. FYI On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim wrote: > I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of > Spark 1.6.0. It seems not to work. I keep getting this error. > >

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Ted Yu
Since yarn-site.xml was cited, I assume the cluster runs YARN. On Fri, May 20, 2016 at 12:30 PM, Rodrick Brown wrote: > Is this Yarn or Mesos? For the later you need to start an external shuffle > service. > > Get Outlook for iOS > > > > >

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Ted Yu
Can you retrieve the log for application_1463681113470_0006 and pastebin it ? Thanks On Fri, May 20, 2016 at 11:48 AM, Cui, Weifeng wrote: > Hi guys, > > > > Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set > dynamic resource allocation for spark and we

Re: Tar File: On Spark

2016-05-19 Thread Ted Yu
See http://memect.co/call-java-from-python-so You can also use Py4J On Thu, May 19, 2016 at 3:20 PM, ayan guha wrote: > Hi > > Thanks for the input. Can it be possible to write it in python? I think I > can use FileUti.untar from hdfs jar. But can I do it from python? > On

Re: Latency experiment without losing executors

2016-05-19 Thread Ted Yu
=drive_web> > ​ > > Geet Kumar > DataSys Laboratory, CS/IIT > Linguistic Cognition Laboratory, CS/IIT > Department of Computer Science, Illinois Institute of Technology (IIT) > Email: gkum...@hawk.iit.edu > > > On Thu, May 19, 2016 at 3:23 AM, Ted Yu <yuzhih...@gm

Re: Spark Streaming Application run on yarn-clustor mode

2016-05-19 Thread Ted Yu
Yes. See https://spark.apache.org/docs/latest/streaming-programming-guide.html On Thu, May 19, 2016 at 7:24 AM, wrote: > Hi Friends, > > Is spark streaming job will run on yarn-cluster mode? > > Thanks > Raj > > > Sent from Yahoo Mail. Get the app

Re: Latency experiment without losing executors

2016-05-19 Thread Ted Yu
I didn't see the code snippet. Were you using picture(s) ? Please pastebin the code. It would be better if you pastebin executor log for the killed executor. Thanks On Wed, May 18, 2016 at 9:41 PM, gkumar7 wrote: > I would like to test the latency (tasks/s) perceived in

Re: Is there a way to run a jar built for scala 2.11 on spark 1.6.1 (which is using 2.10?)

2016-05-18 Thread Ted Yu
Depending on the version of hadoop you use, you may find tar ball prebuilt with Scala 2.11: https://s3.amazonaws.com/spark-related-packages FYI On Wed, May 18, 2016 at 3:35 PM, Koert Kuipers wrote: > no but you can trivially build spark 1.6.1 for scala 2.11 > > On Wed, May

Re: Can Pyspark access Scala API?

2016-05-18 Thread Ted Yu
Not sure if you have seen this (for 2.0): [SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only value Can you tell us your use case ? On Tue, May 17, 2016 at 9:16 PM, Abi wrote: > Can Pyspark access Scala API? The accumulator in pysPark does not

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Ted Yu
Switching from snappy to lzf helped me: > > *spark.io.compression.codec=lzf* > > Do you know why? :) I can't find exact explanation... > > > > 2016-05-18 15:41 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > >> Please increase the number of partitions. >> &

Re: spark udf can not change a json string to a map

2016-05-18 Thread Ted Yu
se my string is maybe a map with a array nested in its value. > for example, map<string, Array>. > I think it can not work fine in my situation. > > Cheers > > -- 原始邮件 -- > *发件人:* "喜之郎";<251922...@qq.com>; > *发送时间:* 2016年5月16日

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Ted Yu
Please increase the number of partitions. Cheers On Wed, May 18, 2016 at 4:17 AM, Serega Sheypak wrote: > Hi, please have a look at log snippet: > 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; > tracker endpoint = >

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Ted Yu
Please see HBASE-14150 The hbase-spark module would be available in the upcoming hbase 2.0 release. On Tue, May 17, 2016 at 11:48 PM, Takeshi Yamamuro wrote: > Hi, > > Have you checked this? > >

Re: File not found exception while reading from folder using textFileStream

2016-05-18 Thread Ted Yu
The following should handle the situation you encountered: diff --git a/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala b/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.sca index ed93058..f79420b 100644 ---

Re: How to change output mode to Update

2016-05-17 Thread Ted Yu
Have you tried adding: .mode(SaveMode.Overwrite) On Tue, May 17, 2016 at 8:55 PM, Todd wrote: > scala> records.groupBy("name").count().write.trigger(ProcessingTime("30 > seconds")).option("checkpointLocation", >

Re: Code Example of Structured Streaming of 2.0

2016-05-17 Thread Ted Yu
Please take a look at: [SPARK-13146][SQL] Management API for continuous queries [SPARK-14555] Second cut of Python API for Structured Streaming On Mon, May 16, 2016 at 11:46 PM, Todd wrote: > Hi, > > Are there code examples about how to use the structured streaming feature? >

Re: Will spark swap memory out to disk if the memory is not enough?

2016-05-16 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtRbEiIXuOOS=Re+PySpark+issue+with+sortByKey+IndexError+list+index+out+of+range+ which led to SPARK-4384 On Mon, May 16, 2016 at 8:09 PM, kramer2...@126.com wrote: > I know the cache operation can cache data in

Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread Ted Yu
I guess 2.0 would be released before Spark Summit. On Mon, May 16, 2016 at 7:19 PM, sunday2000 <2314476...@qq.com> wrote: > Hi, > I found the bug status is : Solved, then when will release the solved > version? > > > -- 原始邮件 ---------- &

Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread Ted Yu
Please take a look at this JIRA: [SPARK-13604][CORE] Sync worker's state after registering with master On Mon, May 16, 2016 at 6:54 PM, sunday2000 <2314476...@qq.com> wrote: > Hi, > > A client woker stoppped, and has this error message, do u know why this > happen? > > 16/05/17 03:42:20 INFO

Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Ted Yu
For 2.0, I believe that is the case. Jenkins jobs have been running against Scala 2.11: [INFO] --- scala-maven-plugin:3.2.2:testCompile (scala-test-compile-first) @ java8-tests_2.11 --- FYI On Mon, May 16, 2016 at 2:45 PM, Eric Richardson wrote: > On Thu, May 12,

Re: Debug spark core and streaming programs in scala

2016-05-16 Thread Ted Yu
>From https://spark.apache.org/docs/latest/monitoring.html#metrics : - JmxSink: Registers metrics for viewing in a JMX console. FYI On Sun, May 15, 2016 at 11:54 PM, Mich Talebzadeh wrote: > Have you tried Spark GUI on 4040. This will show jobs being executed by

Re: pyspark.zip and py4j-0.9-src.zip

2016-05-15 Thread Ted Yu
For py4j, adjust version according to your need: net.sf.py4j py4j 0.10.1 FYI On Sun, May 15, 2016 at 11:55 AM, satish saley wrote: > Hi, > Is there any way to pull in pyspark.zip and py4j-0.9-src.zip in maven > project? > > >

Re: spark udf can not change a json string to a map

2016-05-15 Thread Ted Yu
Can you let us know more about your use case ? I wonder if you can structure your udf by not returning Map. Cheers On Sun, May 15, 2016 at 9:18 AM, 喜之郎 <251922...@qq.com> wrote: > Hi, all. I want to implement a udf which is used to change a json string > to a map. > But some

Re: orgin of error

2016-05-15 Thread Ted Yu
odeProtocolTranslatorPB.java:443) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > > > 2016-05-15 17:58 GMT+02:00 pseudo oduesp <pseudo20...@gmai

Re: orgin of error

2016-05-15 Thread Ted Yu
bq. ExecutorLostFailure (executor 4 lost) Can you check executor log for more clue ? Which Spark release are you using ? Cheers On Sun, May 15, 2016 at 8:47 AM, pseudo oduesp wrote: > someone can help me about this issues > > > > py4j.protocol.Py4JJavaError: An error

Re: Executors and Cores

2016-05-15 Thread Ted Yu
For the last question, have you looked at: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation FYI On Sun, May 15, 2016 at 5:19 AM, Mail.com wrote: > Hi , > > I have seen multiple videos on spark tuning which shows how to determine # > cores,

Re: System memory 186646528 must be at least 4.718592E8.

2016-05-13 Thread Ted Yu
Here is related code: val executorMemory = conf.*getSizeAsBytes*("spark.executor.memory") if (executorMemory < minSystemMemory) { throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " + On Fri, May 13, 2016 at 12:47 PM, satish saley

Re: strange behavior when I chain data frame transformations

2016-05-13 Thread Ted Yu
In the structure shown, tag is under element. I wonder if that was a factor. On Fri, May 13, 2016 at 11:49 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using spark-1.6.1. > > I create a data frame from a very complicated JSON file. I would assume > that query planer would

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Ted Yu
Have you taken a look at SPARK-11293 ? Consider using repartition to increase the number of partitions. FYI On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung wrote: > Hello, > > I'm using Spark version 1.6.0 and have trouble with memory when trying to > do

Re: Tracking / estimating job progress

2016-05-13 Thread Ted Yu
Have you looked at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ? Cheers On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO wrote: > I provide a RESTful API interface from scalatra for launching Spark jobs - > part of the functionality is tracking these

Re: Spark 2.0.0-snapshot: IllegalArgumentException: requirement failed: chunks must be non-empty

2016-05-13 Thread Ted Yu
Is it possible to come up with code snippet which reproduces the following ? Thanks On Fri, May 13, 2016 at 8:13 AM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > I am able to run my application after I compiled Spark source in the > following way > > ./dev/change-scala-version.sh

Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread Ted Yu
The link below doesn't refer to specific bug. Can you send the correct link ? Thanks > On May 12, 2016, at 6:50 PM, "kramer2...@126.com" wrote: > > It seems we hit the same issue. > > There was a bug on 1.5.1 about memory leak. But I am using 1.6.1 > > Here is the link

Re: How to get and save core dump of native library in executors

2016-05-12 Thread Ted Yu
Which OS are you using ? See http://en.linuxreviews.org/HOWTO_enable_core-dumps On Thu, May 12, 2016 at 2:23 PM, prateek arora wrote: > Hi > > I am running my spark application with some third party native libraries . > but it crashes some time and show error "

Re: kryo

2016-05-12 Thread Ted Yu
e.DateTimeZone.convertUTCToLocal(DateTimeZone.java:925) > > > > > > Any ideas? > > > > Thanks > > > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* May-11-16 5:32 PM > *To:* Younes Naguib > *Cc:* user@spark.apache.org > *Subject:* Re: kr

Re: kryo

2016-05-11 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtpO0qI3cp06/JodaDateTimeSerializer+spark=Re+NPE+when+using+Joda+DateTime On Wed, May 11, 2016 at 2:18 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > I'm trying to get to use spark.serializer. > I set it in the

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Ted Yu
Which release are you using ? You can use the following to disable UI: --conf spark.ui.enabled=false On Wed, May 11, 2016 at 10:59 AM, Amit Sela wrote: > I've ran a simple WordCount example with a very small List as > input lines and ran it in standalone (local[*]), and

Re: Save DataFrame to HBase

2016-05-11 Thread Ted Yu
Please note: The name of hbase table is specified in: def writeCatalog = s"""{ |"table":{"namespace":"default", "name":"table1"}, not by the: HBaseTableCatalog.newTable -> "5" FYI On T

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu
In master branch, behavior is the same. Suggest opening a JIRA if you haven't done so. On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > Hi guys, > > I have a problem about spark DataFrame. My spark version is 1.6.1. > Basically, i used udf and df.withColumn to create a

Re: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B

2016-05-11 Thread Ted Yu
Looks like the exception was thrown from this line: ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) Comment for taskBinary says: * @param taskBinary broadcasted version of the serialized RDD and the function to apply on each * partition

Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread Ted Yu
Which Spark release are you using ? I assume executor crashed due to OOME. Did you have a chance to capture jmap on the executor before it crashed ? Have you tried giving more memory to the executor ? Thanks On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com wrote: > I

Re: Save DataFrame to HBase

2016-05-10 Thread Ted Yu
rk module allow for creating tables in Spark SQL that > reference the hbase tables underneath? In this way, users can query using > just SQL. > > Thanks, > Ben > > On Apr 28, 2016, at 3:09 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Hbase 2.0 release likely would come after Spar

Re: Accessing Cassandra data from Spark Shell

2016-05-09 Thread Ted Yu
bq. Can you use HiveContext for Cassandra data? Most likely the above cannot be done. On Mon, May 9, 2016 at 9:08 PM, Cassa L wrote: > Hi, > Has anyone tried accessing Cassandra data using SparkShell? How do you do > it? Can you use HiveContext for Cassandra data? I'm using

Re: Is it a bug?

2016-05-09 Thread Ted Yu
p.com> wrote: > How come that for the first() function it calculates an updated value and > for collect it doesn't ? > > > > On Sun, May 8, 2016 at 4:17 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> I don't think so. >> RDD is immutable. >> >> >

Re: java.lang.NoClassDefFoundError: kafka/api/TopicMetadataRequest

2016-05-09 Thread Ted Yu
NoClassDefFoundError is different than saying that it could not be loaded from the classpath. >From my experience, there should be some other error before this error which would give you better idea. You can also check whether another version of kafka is embedded in any of the jars listed below.

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Ted Yu
sing yarn client mode hence I specified am settings too. > What you mean akka is moved out of picture? I am using spark 2.5.1 > > Sent from my iPhone > > On May 8, 2016, at 6:39 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Are you using YARN client mode ? > > See >

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Ted Yu
le operation but not >>> actually doing anything within its own system that will cause memory issue. >>> Can you explain in what circumstances I could see this error in driver >>> logs? I don't do any collect or any other driver operation that would cause >>> this. It

Re: Is it a bug?

2016-05-08 Thread Ted Yu
I don't think so. RDD is immutable. > On May 8, 2016, at 2:14 AM, Sisyphuss wrote: > > > > > > -- > View this message in context: >

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Ted Yu
bq. at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129) It was Akka which uses JavaSerializer Cheers On Sat, May 7, 2016 at 11:13 AM, Nirav Patel wrote: > Hi, > > I thought I was using kryo serializer for shuffle. I could verify it from > spark UI -

Re: Updating Values Inside Foreach Rdd loop

2016-05-06 Thread Ted Yu
f memory). > 2. Using Hive tables and update the same table after each iteration. > > Please suggest,which one of the methods listed above will be good to use , or > is there are more better ways to accomplish it. > > >> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yuzhih...@gma

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Ted Yu
I was reading StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf attached to SPARK-8360 On page 12, there was mentioning of .format(“kafka”) but I searched the codebase and didn't find any occurrence. FYI On Fri, May 6, 2016 at 1:06 PM, Michael Malak <

Re: getting NullPointerException while doing left outer join

2016-05-06 Thread Ted Yu
Is it possible to write a short test which exhibits this problem ? For Spark 2.0, this part of code has changed: [SPARK-4819] Remove Guava's "Optional" from public API FYI On Fri, May 6, 2016 at 6:57 AM, Adam Westerman wrote: > Hi, > > I’m attempting to do a left outer

Re: Updating Values Inside Foreach Rdd loop

2016-05-06 Thread Ted Yu
Please see the doc at the beginning of RDD class: * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, * partitioned collection of elements that can be operated on in parallel. This class contains the * basic operations available on all RDDs, such

Re: [Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Ted Yu
I am afraid there is no such API. When persisting, you can specify StorageLevel : def persist(newLevel: StorageLevel): this.type = { Can you tell us your use case ? Thanks On Thu, May 5, 2016 at 8:06 PM, Divya Gehlot wrote: > Hi, > How can I get and set storage

Re: DeepSpark: where to start

2016-05-04 Thread Ted Yu
Did you notice the date of the blog :-) ? On Wed, May 4, 2016 at 7:42 PM, Joice Joy wrote: > I am trying to find info on deepspark. I read the article on databricks > blog which doesnt mention a git repo but does say its open source. > Help me find the git repo for this.

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtyXr2N13hf9O=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak wrote: > Hi, > > I am reading the parquet file around 50+ G which has 4013 partitions

Re: Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Ted Yu
Looks like you were hitting HIVE-11940 On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak wrote: > Hello, > > I am writing Dataframe of around 60+ GB into partitioned Hive Table using > hiveContext in parquet format. The Spark insert overwrite jobs completes in > a reasonable

Re: IS spark have CapacityScheduler?

2016-05-04 Thread Ted Yu
Cycling old bits: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scheduling-with-Capacity-scheduler-td10038.html On Wed, May 4, 2016 at 7:44 AM, 开心延年 wrote: > Scheduling Within an Application > > I found FAIRSchedule,but is there som exampe implements like yarn >

Re: restrict my spark app to run on specific machines

2016-05-04 Thread Ted Yu
Please refer to: https://spark.apache.org/docs/latest/running-on-yarn.html You can setup spark.yarn.am.nodeLabelExpression and spark.yarn.executor.nodeLabelExpression corresponding to the 2 machines. On Wed, May 4, 2016 at 3:03 AM, Shams ul Haque wrote: > Hi, > > I have a

Re: Spark Select Statement

2016-05-04 Thread Ted Yu
Please take a look at sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java : } else if (key.startsWith("use:")) { SessionState.get().setCurrentDatabase(entry.getValue()); bq. no such table winbox_prod_action_logs_1 The above doesn't match

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread Ted Yu
Final Memory: 392M/1520M > [INFO] > > > > -- 原始邮件 -- > *发件人:* "sunday2000";<2314476...@qq.com>; > *发送时间:* 2016年5月3日(星期二) 中午11:41 > *收件人:* "Ted Yu"<yuzhih...@gmail.com>; > *抄送:* "user"<user@spark.apache.org>

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread Ted Yu
(TM) 64-Bit Server VM (build 25.91-b14, mixed mode) > > maven version: > spark-1.6.1/build/apache-maven-3.3.3/bin/mvn > > > > -- 原始邮件 -- > *发件人:* "Ted Yu";<yuzhih...@gmail.com>; > *发送时间:* 2016年5月3日(星期二) 上午10:43 > *收件人:*

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread Ted Yu
Looks like this was continuation of your previous query. If that is the case, please use original thread so that people can have more context. Have you tried disabling Zinc server ? What's the version of Java / maven you're using ? Are you behind proxy ? Finally the 1.6.1 artifacts are

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Ted Yu
>From the output of dependency:tree of master branch: [INFO] [INFO] Building Spark Project Docker Integration Tests 2.0.0-SNAPSHOT [INFO] [WARNING] The

<    1   2   3   4   5   6   7   8   9   10   >