Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-25 Thread Dibyendu Bhattacharya
Hi, Recently I was working on a PR to use Tachyon as OFF_HEAP store for Spark Streaming and make sure Spark Streaming can recover from Driver failure and recover the blocks form Tachyon. The The Motivation for this PR is : If Streaming application stores the blocks OFF_HEAP, it may not need any

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-09-25 Thread N B
Hi Dibyendu, How does one go about configuring spark streaming to use tachyon as its place for storing checkpoints? Also, can one do this with tachyon running on a completely different node than where spark processes are running? Thanks Nikunj On Thu, May 21, 2015 at 8:35 PM, Dibyendu Bhattacha

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Gavin Yue
It is working, We are doing the same thing everyday. But the remote server needs to able to talk with ResourceManager. If you are using Spark-submit, your will also specify the hadoop conf directory in your Env variable. Spark would rely on that to locate where the cluster's resource manager is.

Re: Networking issues with Spark on EC2

2015-09-25 Thread SURAJ SHETH
Hi, Nopes. I was trying to use EC2(due to a few constraints) where I faced the problem. With EMR, it works flawlessly. But, I would like to go back and use EC2 if I can fix this issue. Has anybody set up a spark cluster using plain EC2 machines. What steps did you follow? Thanks and Regards, Suraj

Re: Networking issues with Spark on EC2

2015-09-25 Thread Natu Lauchande
Hi, Are you using EMR ? Natu On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH wrote: > Hi Ankur, > Thanks for the reply. > This is already done. > If I wait for a long amount of time(10 minutes), a few tasks get > successful even on slave nodes. Sometime, a fraction of the tasks(20%) are > complet

Re: Networking issues with Spark on EC2

2015-09-25 Thread SURAJ SHETH
Hi Ankur, Thanks for the reply. This is already done. If I wait for a long amount of time(10 minutes), a few tasks get successful even on slave nodes. Sometime, a fraction of the tasks(20%) are completed on all the machines in the initial 5 seconds and then, it slows down drastically. Thanks and R

What is this Input Size in Spark Application Detail UI?

2015-09-25 Thread Chirag Dewan
Hi All, I was wondering what does the Input Size in Application UI mean? For my 3 node Cassandra Cluster, with 3 node Spark Cluster this size is 32GB. For my 15 node Cassandra Cluster, with 15 node Spark Cluster this size reaches 172GB. Though the data in both clusters is about same volume. C

Re: Weird worker usage

2015-09-25 Thread N B
Bryan, By any chance, are you calling SparkConf.setMaster("local[*]") inside your application code? Nikunj On Fri, Sep 25, 2015 at 9:56 AM, Bryan Jeffrey wrote: > Looking at this further, it appears that my Spark Context is not correctly > setting the Master name. I see the following in logs:

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Alexis Gillain
I think gavin want you to print the env variables from the remote machine. Can you provide the spark-submit command line. Are you able to run the repl from the remote machine ? ./bin/spark-shell --master yarn-client 2015-09-26 10:11 GMT+08:00 Zhiliang Zhu : > Hi Yue, > > Thanks very much for yo

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Zhiliang Zhu
Hi Yue, Thanks very much for your kind reply. I would like to submit spark job remotely on another machine outside the cluster,and the job will run on yarn, similar as hadoop job is already done, could youconfirm it could exactly work for spark... Do you mean that I would print those variables on

Re: How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Gavin Yue
Print out your env variables and check first Sent from my iPhone > On Sep 25, 2015, at 18:43, Zhiliang Zhu wrote: > > Hi All, > > I would like to submit spark job on some another remote machine outside the > cluster, > I also copied hadoop/spark conf files under the remote machine, then hado

How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Zhiliang Zhu
Hi All, I would like to submit spark job on some another remote machine outside the cluster,I also copied hadoop/spark conf files under the remote machine, then hadoopjob would be submitted, but spark job would not. In spark-env.sh, it may be due to that SPARK_LOCAL_IP is not properly set,or for

How to properly set conf/spark-env.sh for spark to run on yarn

2015-09-25 Thread Zhiliang Zhu
Hi All, I would like to submit spark job on some another remote machine outside the cluster,I also copied hadoop/spark conf files under the remote machine, then hadoopjob would be submitted, but spark job would not. In spark-env.sh, it may be due to that SPARK_LOCAL_IP is not properly set,or for

Re: Spark for Oracle sample code

2015-09-25 Thread Michael Armbrust
In most cases predicates that you add to jdbcDF will be push down into oracle, preventing the whole table from being sent over. df.where("column = 1") Another common pattern is to save the table to parquet or something for repeat querying. Michael On Fri, Sep 25, 2015 at 3:13 PM, Cui Lin wrote

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Michael Armbrust
The SQL parser without HiveContext is really simple, which is why I generally recommend users use HiveContext. However, you can do it with dataframes: import org.apache.spark.sql.functions._ table("purchases").select(explode(df("purchase_items")).as("item")) On Fri, Sep 25, 2015 at 4:21 PM, Je

Error in starting sparkR: Error in socketConnection(port = monitorPort) :

2015-09-25 Thread Jonathan Yue
I have been trying to start p sparkR but always get the error about monitorPort: export LD_LIBRARY_PATH=$HOME/jaguar/lib:$HOME/opt/hadoop/lib/nativeJDBCJAR=$HOME/jaguar/lib/jaguar-jdbc-2.0.jarsparkR \ --driver-class-path $JDBCJAR \ --driver-library-path $HOME/jaguar/lib \  --conf spark.executor.

Re: Spark for Oracle sample code

2015-09-25 Thread Jonathan Yue
In your dbtable you can insert "select ..." instead of table name. I never tried, but saw example from the web. Best regards,  Jonathan  From: Cui Lin To: user Sent: Friday, September 25, 2015 4:12 PM Subject: Spark for Oracle sample code Hello, All, I found the examples for JDBC

Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Jerry Lam
Hi sparkers, Anyone knows how to do LATERAL VIEW EXPLODE without HiveContext? I don't want to start up a metastore and derby just because I need LATERAL VIEW EXPLODE. I have been trying but I always get the exception like this: Name: java.lang.RuntimeException Message: [1.68] failure: ``union''

Spark for Oracle sample code

2015-09-25 Thread Cui Lin
Hello, All, I found the examples for JDBC connection are mostly read the whole table and then do operations like joining. val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load() Sometimes it is not practical sinc

Fwd: Spark for Oracle sample code

2015-09-25 Thread Cui Lin
Hello, All, I found the examples for JDBC connection are mostly read the whole table and then do operations like joining. val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load() Sometimes it is not practical sinc

GraphX create graph with multiple node attributes

2015-09-25 Thread JJ
Hi, I am new to Spark and GraphX, so thanks in advance for your patience. I want to create a graph with multiple node attributes. Here is my code: But I receive error: Can someone help? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-c

[SPARK-SQL] Requested array size exceeds VM limit

2015-09-25 Thread Sadhan Sood
I am trying to run a query on a month of data. The volume of data is not much, but we have a partition per hour and per day. The table schema is heavily nested with total of 300 leaf fields. I am trying to run a simple select count(*) query on the table and running into this exception: SELECT

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
BTW, just checked that this bug should have been fixed since Hive 0.14.0. So the SQL option I mentioned is mostly used for reading legacy Parquet files generated by older versions of Hive. Cheng On 9/25/15 2:42 PM, Cheng Lian wrote: Please set the the SQL option spark.sql.parquet.binaryAsStrin

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
Please set the the SQL option spark.sql.parquet.binaryAsString to true when reading Parquet files containing strings generated by Hive. This is actually a bug of parquet-hive. When generating Parquet schema for a string field, Parquet requires a "UTF8" annotation, something like: message hive

Re: Distance metrics in KMeans

2015-09-25 Thread sethah
It looks like the distance metric is hard coded to the L2 norm (euclidean distance) in MLlib. As you may expect, you are not the first person to desire other metrics and there has been some prior effort. Please reference this PR: https://github.com/apache/spark/pull/2634 And corresponding JIRA:

how to control timeout in node failure for spark task ?

2015-09-25 Thread roy
Hi, We are running Spark 1.3 on CDH 5.4.1 on top of YARN. we want to know how do we control task timeout when node fails and task running on it should be restarted on another node. at present job wait for approximately 10 min to restart the task were running on failed node. http://spark.apache.

Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread java8964
Hi, Spark Users: I have a problem related to Spark cannot recognize the string type in the Parquet schema generated by Hive. Version of all components: Spark 1.3.1Hive 0.12.0Parquet 1.3.2 I generated a detail low level table in the Parquet format using MapReduce java code. This table can be read

Distance metrics in KMeans

2015-09-25 Thread bobtreacy
Is it possible to use other distance metrics than Euclidean (e.g. Tanimoto, Manhattan) with MLlib KMeans? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823.html Sent from the Apache Spark User List mailing list archive at Nab

Re: hive on spark query error

2015-09-25 Thread Marcelo Vanzin
Seems like you have "hive.server2.enable.doAs" enabled; you can either disable it, or configure hs2 so that the user running the service ("hadoop" in your case) can impersonate others. See: https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/Superusers.html On Fri, Sep 25, 201

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Thanks. Ill keep an eye on this. Our implementation of the DStream basically accepts a function to compute current offsets. The implementation of the function fetches list of topics from zookeeper once in while. It then adds consumer offsets for newly added topics with the currentOffsets thats in

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
Yes, the partition IDs are the same. As far as the failure / subclassing goes, you may want to keep an eye on https://issues.apache.org/jira/browse/SPARK-10320 , not sure if the suggestions in there will end up going anywhere. On Fri, Sep 25, 2015 at 3:01 PM, Neelesh wrote: > For the 1-1 mappin

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Neelesh
As Cody says, to achieve true exactly once, the book keeping has to happen in the sink data system, that too assuming its a transactional store. Wherever possible, we try to make the application idempotent (upsert in HBase, ignore-on-duplicate for MySQL etc), but there are still cases (analytics, c

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
For the 1-1 mapping case, can I use TaskContext.get().partitionId as an index in to the offset ranges? For the failure case, yes, I'm subclassing of DirectKafkaInputDStream. As for failures, different partitions in the same batch may be talking to different RDBMS servers due to multitenancy - a spa

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Cody Koeninger
Spark's checkpointing system is not a transactional database, and it doesn't really make sense to try and turn it into one. On Fri, Sep 25, 2015 at 2:15 PM, Radu Brumariu wrote: > Wouldn't the same case be made for checkpointing in general ? > What I am trying to say, is that this particular sit

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
Your success case will work fine, it is a 1-1 mapping as you said. To handle failures in exactly the way you describe, you'd need to subclass or modify DirectKafkaInputDStream and change the way compute() works. Unless you really are going to have very fine-grained failures (why would only a give

Re: spark.streaming.concurrentJobs

2015-09-25 Thread Atul Kulkarni
Can someone please help either by explaining or pointing to documentation the relationship between #executors needed and How to let the concurrent jobs that are created by the above parameter run in parallel? On Thu, Sep 24, 2015 at 11:56 PM, Atul Kulkarni wrote: > Hi Folks, > > I am trying to s

how to handle OOMError from groupByKey

2015-09-25 Thread Elango Cheran
Hi everyone, I have an RDD of the format (user: String, timestamp: Long, state: Boolean). My task invovles converting the states, where on/off is represented as true/false, into intervals of 'on' of the format (beginTs: Long, endTs: Long). So this task requires me, per user, to line up all of the

Re: Convert Vector to RDD[Double]

2015-09-25 Thread Sourigna Phetsarath
import org.apache.spark.mllib.linalg._ val v = Vectors.dense(1.0,2.0) val rdd = sc.parallelize(v.toArray) On Fri, Sep 25, 2015 at 2:46 PM, Yusuf Can Gürkan wrote: > How can i convert a Vector to RDD[Double]. For example: > > val vector = Vectors.dense(1.0,2.0) > val rdd // i need sc.parallel

Re: executor-cores setting does not work under Yarn

2015-09-25 Thread Gavin Yue
I think I found the problem. Have to change the yarn capacity scheduler to use DominantResourceCalculator Thanks! On Fri, Sep 25, 2015 at 4:54 AM, Akhil Das wrote: > Which version of spark are you having? Can you also check whats set in > your conf/spark-defaults.conf file? > > Thanks > Best

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Radu Brumariu
Wouldn't the same case be made for checkpointing in general ? What I am trying to say, is that this particular situation is part of the general checkpointing use case, not an edge case. I would like to understand why shouldn't the checkpointing mechanism, already existent in Spark, handle this situ

Re: Reading Hive Tables using SQLContext

2015-09-25 Thread Michael Armbrust
Eventually I'd like to eliminate HiveContext, but for now I just recommend that most users use it instead of SQLContext. On Thu, Sep 24, 2015 at 5:41 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Thanks Michael. Just want to check if there is a roadmap to include Hive > tabl

Re: Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Thanks Petr, Cody. This is a reasonable place to start for me. What I'm trying to achieve stream.foreachRDD {rdd=> rdd.foreachPartition { p=> Try(myFunc(...)) match { case Sucess(s) => updatewatermark for this partition //of course, expectation is that it will work only if the

Convert Vector to RDD[Double]

2015-09-25 Thread Yusuf Can Gürkan
How can i convert a Vector to RDD[Double]. For example: val vector = Vectors.dense(1.0,2.0) val rdd // i need sc.parallelize(Array(1.0,2.0)) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-

Broadcast to executors with multiple cores

2015-09-25 Thread Jeff Palmucci
So I have a large data structure that I want to broadcast to my executors. It is so large that it makes sense to share access to the object between multiple tasks, so I create my executors with multiple cores. Unfortunately, it looks like the object is not shared between threads, but is copied o

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers also has an example of how to close over the offset ranges so they are available on executors. On Fri, Sep 25, 2015 at 12:50 PM, Neelesh wrote: > Hi, >We are using DirectKafkaInputDS

Re: Kafka & Spark Streaming

2015-09-25 Thread Petr Novak
You can have offsetRanges on workers f.e. object Something { var offsetRanges = Array[OffsetRange]() def create[F : ClassTag](stream: InputDStream[Array[Byte]]) (implicit codec: Codec[F]: DStream[F] = { stream transform { rdd => offsetRanges = rdd.asInstanceOf[HasOffsetRanges]

Kafka & Spark Streaming

2015-09-25 Thread Neelesh
Hi, We are using DirectKafkaInputDStream and store completed consumer offsets in Kafka (0.8.2). However, some of our use case require that offsets be not written if processing of a partition fails with certain exceptions. This allows us to build various backoff strategies for that partition, ins

Re: Weird worker usage

2015-09-25 Thread N B
Hi Akhil, I do have 25 partitions being created. I have set the spark.default.parallelism property to 25. Batch size is 30 seconds and block interval is 1200 ms which also gives us roughly 25 partitions from the input stream. I can see 25 partitions being created and used in the Spark UI also. Its

Re: Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-25 Thread Cheng Lian
Thanks for the clarification. Could you please provide the full schema of your table and query plans of your query? You may obtain them via: hiveContext.table("your_table").printSchema() and hiveContext.sql("your query").explain(extended = true) You also mentioned "Thrift" in the subject, did

RE: hive on spark query error

2015-09-25 Thread Garry Chen
Yes you are right. Make the change and also link hive-site.xml into spark conf directory. Rerun the sql getting error in hive.log 2015-09-25 13:31:14,750 INFO [HiveServer2-Handler-Pool: Thread-125]: client.SparkClientImpl (SparkClientImpl.java:startDriver(375)) - Attempting impersonation of

Re: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv
I tried but I'm getting the same error (task not serializable) > On 25 בספט׳ 2015, at 20:10, Ted Yu wrote: > > Is the Schema.parse() call expensive ? > > Can you call it in the closure ? > >> On Fri, Sep 25, 2015 at 10:06 AM, Daniel Haviv >> wrote: >> Hi, >> I'm getting a NotSerializableExce

Re: hive on spark query error

2015-09-25 Thread Marcelo Vanzin
On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen wrote: > In spark-defaults.conf the spark.master is spark://hostname:7077. From > hive-site.xml > spark.master > hostname > That's not a valid value for spark.master (as the error indicates). You should set it to "spark://hostname:7077"

Re: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Ted Yu
Is the Schema.parse() call expensive ? Can you call it in the closure ? On Fri, Sep 25, 2015 at 10:06 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm getting a NotSerializableException even though I'm creating all the my > objects from within the closure: > import org.apa

java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv
Hi, I'm getting a NotSerializableException even though I'm creating all the my objects from within the closure: import org.apache.avro.generic.GenericDatumReader import java.io.File import org.apache.avro._ val orig_schema = Schema.parse(new File("/home/wasabi/schema")) val READER = new Gen

RE: hive on spark query error

2015-09-25 Thread Garry Chen
In spark-defaults.conf the spark.master is spark://hostname:7077. From hive-site.xml spark.master hostname From: Jimmy Xiang [mailto:jxi...@cloudera.com] Sent: Friday, September 25, 2015 1:00 PM To: Garry Chen Cc: user@spark.apache.org Subject: Re: hive on spark query error >

Re: --class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
I'm sorry. Both approaches actually work. It was something else wrong with my cluster. Petr On Fri, Sep 25, 2015 at 4:53 PM, Petr Novak wrote: > Either setting it programatically doesn't work: > sparkConf.setIfMissing("class", "...Main") > > In my current setting moving main to another package r

Re: hive on spark query error

2015-09-25 Thread Jimmy Xiang
> Error: Master must start with yarn, spark, mesos, or local What's your setting for spark.master? On Fri, Sep 25, 2015 at 9:56 AM, Garry Chen wrote: > Hi All, > > I am following > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started? > to setup hive

hive on spark query error

2015-09-25 Thread Garry Chen
Hi All, I am following https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started? to setup hive on spark. After setup/configuration everything startup I am able to show tables but when executing sql statement within beeline I got error. Please help and

Re: Weird worker usage

2015-09-25 Thread Bryan Jeffrey
Looking at this further, it appears that my Spark Context is not correctly setting the Master name. I see the following in logs: 15/09/25 16:45:42 INFO DriverRunner: Launch Command: "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" "-cp" "/spark/spark-1.4.1/sbin/../conf/:/spark/spark-1.4.1/assembl

Handle null/NaN values in mllib classifier

2015-09-25 Thread matd
Hi folks, I have a set of categorical columns (strings), that I'm parsing and converting into Vectors of features to pass to a mllib classifier (random forest). In my input data, some columns have null values. Say, in one of those columns, I have p values + a null value : How should I build my f

Re: Generic DataType in UDAF

2015-09-25 Thread Ritesh Agrawal
hi Yin, I have a written a simple UDAF to generate N samples for each group. I am using reservoir sampling algorithm for this. In this case since the input data type doesn't matter as I am not doing any kind of processing on the input data but just selecting them by random and building an array an

Re: Generic DataType in UDAF

2015-09-25 Thread Yin Huai
Hi Ritesh, Right now, we only allow specific data types defined in the inputSchema. Supporting abstract types (e.g. NumericType) may cause the logic of a UDAF be more complex. It will be great to understand the use cases first. What kinds of possible input data types that you want to support and d

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Cody Koeninger
Storing passbacks transactionally with results in your own data store, with a schema that makes sense for you, is the optimal solution. On Fri, Sep 25, 2015 at 11:05 AM, Radu Brumariu wrote: > Right, I understand why the exceptions happen. > However, it seems less useful to have a checkpointing

Re: spark.mesos.coarse impacts memory performance on mesos

2015-09-25 Thread Tim Chen
Hi Utkarsh, What is your job placement like when you run fine grain mode? You said coarse grain mode only ran with one node right? And when the job is running could you open the Spark webui and get stats about the heap size and other java settings? Tim On Thu, Sep 24, 2015 at 10:56 PM, Utkarsh

Re: Weird worker usage

2015-09-25 Thread Bryan Jeffrey
I am seeing a similar issue when reading from Kafka. I have a single Kafka broker with 1 topic and 10 partitions on a separate machine. I have a three-node spark cluster, and verified that all workers are registered with the master. I'm initializing Kafka using a similar method to this article:

Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
Good catch, I was not aware of this setting. I’m wondering though if it also generates a shuffle or if the data is still processed by the node on which it’s ingested - so that you’re not gated by the number of cores on one machine. -adrian On 9/25/15, 5:27 PM, "Silvio Fiorito" wrote: >One

Re: Java Heap Space Error

2015-09-25 Thread Yusuf Can Gürkan
Hello, It worked like a charm. Thank you very much. Some userid’s were null that’s why many records go to userid ’null’. When i put a where clause: userid != ‘null’, it solved problem. > On 24 Sep 2015, at 22:43, java8964 wrote: > > I can understand why your first query will finish without OO

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Radu Brumariu
Right, I understand why the exceptions happen. However, it seems less useful to have a checkpointing that only works in the case of an application restart. IMO, code changes happen quite often, and not being able to pick up where the previous job left off is quite a bit of a hinderance. The soluti

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-25 Thread Sourabh Chandak
Thanks Cody I was able to find out the issue yesterday after sending the last email. On Friday, September 25, 2015, Cody Koeninger wrote: > So you're still having a problem getting partitions or offsets from kafka > when creating the stream. You can try each of those kafka operations > individu

Generic DataType in UDAF

2015-09-25 Thread Ritesh Agrawal
Hi all, I am trying to learn about UDAF and implemented a simple reservoir sample UDAF. It's working fine. However I am not able to figure out what DataType should I use so that its can deal with all DataTypes (simple and complex). For instance currently I have defined my input schema as def inp

Re: --class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
Either setting it programatically doesn't work: sparkConf.setIfMissing("class", "...Main") In my current setting moving main to another package requires to propagate change to deploy scripts. Doesn't matter I will find some other way. Petr On Fri, Sep 25, 2015 at 4:40 PM, Petr Novak wrote: > Or

--class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
Ortherwise it seems it tries to load from a checkpoint which I have deleted and cannot be found. Or it should work and I have wrong something else. Documentation doesn't mention option with jar manifest, so I assume it doesn't work this way. Many thanks, Petr

Re: Receiver and Parallelization

2015-09-25 Thread Silvio Fiorito
One thing you should look at is your batch duration and spark.streaming.blockInterval Those 2 things control how many partitions are generated for each RDD (batch) of the DStream when using a receiver (vs direct approach). So if you have a 2 second batch duration and the default blockInterval o

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-25 Thread Cody Koeninger
So you're still having a problem getting partitions or offsets from kafka when creating the stream. You can try each of those kafka operations individually (getPartitions / getLatestLeaderOffsets) checkErrors should be dealing with an arraybuffer of throwables, not just a single one. Is that the

Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
1) yes, just use .repartition on the inbound stream, this will shuffle data across your whole cluster and process in parallel as specified. 2) yes, although I’m not sure how to do it for a totally custom receiver. Does this help as a starting point? http://spark.apache.org/docs/latest/streaming-

HDFS is undefined

2015-09-25 Thread Angel Angel
hello, I am running the spark application. I have installed the cloudera manager. it includes the spark version 1.2.0 But now i want to use spark version 1.4.0. its also working fine. But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting the following error. "Exception in t

Re: Unreachable dead objects permanently retained on heap

2015-09-25 Thread Saurav Sinha
Hi Spark Users, I am running some spark jobs which is running every hour.After running for 12 hours master is getting killed giving exception as *java.lang.OutOfMemoryError: GC overhead limit exceeded* It look like there is some memory issue in spark master. Same kind of issue I noticed with sp

Best practices for scheduling Spark jobs on "shared" YARN cluster using Autosys

2015-09-25 Thread unk1102
Hi I have 5 Spark jobs which needs to be run in parallel to speed up process they take around 6-8 hours together. I have 93 container nodes with 8 cores each memory capacity of around 2.8 TB. Now I runs each jobs with around 30 executors with 2 cores and 20 GB each. My each jobs processes around 1

Receiver and Parallelization

2015-09-25 Thread nibiau
Hello, I used a custom receiver in order to receive JMS messages from MQ Servers. I want to benefit of Yarn cluster, my questions are : - Is it possible to have only one node receiving JMS messages and parralelize the RDD over all the cluster nodes ? - Is it possible to parallelize also the messa

Re: LogisticRegression models consumes all driver memory

2015-09-25 Thread Eugene Zhulenev
Problem turned out to be in too high 'spark.default.parallelism', BinaryClassificationMetrics are doing combineByKey which internally shuffle train dataset. Lower parallelism + cutting train set RDD history with save/read into parquet solved the problem. Thanks for hint! On Wed, Sep 23, 2015 at 11

Transformation pipeling and parallelism in Spark

2015-09-25 Thread Zhongmiao Li
Hello all, I have a question regarding the pipelining and parallelism of transformations in Spark. I couldn’t find any documentation about it and I would really appreciate your help if you could help me with it. I just started using and reading Spark, so I guess my description may not be very

回复: sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread our...@cnsuning.com
https://issues.apache.org/jira/browse/SPARK-10832 发件人: our...@cnsuning.com 发送时间: 2015-09-25 20:36 收件人: user 抄送: 494165115 主题: sometimes No event logs found for application using same JavaSparkSQL example hi all, when using JavaSparkSQL example,the code was submit many times as followin

sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread our...@cnsuning.com
hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , s

Re: Setting Spark TMP Directory in Cluster Mode

2015-09-25 Thread Akhil Das
Try with spark.local.dir in the spark-defaults.conf or SPARK_LOCAL_DIR in the spark-env.sh file. Thanks Best Regards On Fri, Sep 25, 2015 at 2:14 PM, mufy wrote: > Faced with an issue where Spark temp files get filled under > /opt/spark-1.2.1/tmp on the local filesystem on the worker nodes. Whi

Re: Error: Asked to remove non-existent executor

2015-09-25 Thread Akhil Das
What you mean by you are behind a NAT? Does it mean you are submitting your jobs to a remote spark cluster from your local machine? If that's the case then you need to take care of few ports (in the NAT) http://spark.apache.org/docs/latest/configuration.html#networking which assume random as defaul

Re: Weird worker usage

2015-09-25 Thread Akhil Das
Parallel tasks totally depends on the # of partitions that you are having, if you are not receiving sufficient partitions (partitions > total # cores) then try to do a .repartition. Thanks Best Regards On Fri, Sep 25, 2015 at 1:44 PM, N B wrote: > Hello all, > > I have a Spark streaming applica

Re: How to set spark envoirnment variable SPARK_LOCAL_IP in conf/spark-env.sh

2015-09-25 Thread Zhiliang Zhu
On Friday, September 25, 2015 7:46 PM, Zhiliang Zhu wrote: Hi all, The spark job will run on yarn. While I do not set SPARK_LOCAL_IP any, or just set asexport  SPARK_LOCAL_IP=localhost    #or set as the specific node ip on the specific spark install directory It will work well

Re: Using Spark for portfolio manager app

2015-09-25 Thread Adrian Tanase
Just use the official connector from DataStax https://github.com/datastax/spark-cassandra-connector Your solution is very similar. Let’s assume the state is case class UserState(amount: Int, updates: Seq[Int]) And your user has 100 - If your user does not see an update, you can emit Some(UserS

Spark task error

2015-09-25 Thread madhvi.gupta
Hi, My configurations are follows: SPARK_EXECUTOR_INSTANCES=4 SPARK_EXECUTOR_MEMORY=1G But on my spark UI it shows: * *Alive Workers:*1 * *Cores in use:*4 Total, 0 Used * *Memory in use:*6.7 GB Total, 0.0 B Used Also while running a program in java for spark I am getting the following er

Re: executor-cores setting does not work under Yarn

2015-09-25 Thread Akhil Das
Which version of spark are you having? Can you also check whats set in your conf/spark-defaults.conf file? Thanks Best Regards On Fri, Sep 25, 2015 at 1:58 AM, Gavin Yue wrote: > Running Spark app over Yarn 2.7 > > Here is my sparksubmit setting: > --master yarn-cluster \ > --num-executors 100

How to set spark envoirnment variable SPARK_LOCAL_IP in conf/spark-env.sh

2015-09-25 Thread Zhiliang Zhu
Hi all, The spark job will run on yarn. While I do not set SPARK_LOCAL_IP any, or just set asexport  SPARK_LOCAL_IP=localhost    #or set as the specific node ip on the specific spark install directory It will work well to submit spark job on master node of cluster, however, it will fail by way

Unreachable dead objects permanently retained on heap

2015-09-25 Thread James Aley
Hi, We have an application that submits several thousands jobs within the same SparkContext, using a thread pool to run about 50 in parallel. We're running on YARN using Spark 1.4.1 and seeing a problem where our driver is killed by YARN due to running beyond physical memory limits (no Java OOM st

Troubles interacting with different version of Hive metastore

2015-09-25 Thread Ferran Galí
Hello, I'm trying to start the SparkSQL thriftserver over YARN, connecting it to the 1.1.0-cdh5.4.3 hive metastore that we already have in production. I downloaded the latest version of Spark (1.5.0), I just followed the instructions from the documentation

Re: how to submit the spark job outside the cluster

2015-09-25 Thread Zhiliang Zhu
It seems that is due to spark  SPARK_LOCAL_IP setting.export SPARK_LOCAL_IP=localhost will not work. Then, how it would be set. Thank you all~~ On Friday, September 25, 2015 5:57 PM, Zhiliang Zhu wrote: Hi Steve, Thanks a lot for your reply. That is, some commands could work on

Re: Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-25 Thread Uthayan Suthakar
Thank you Tathagata and Therry for your response. You guys were absolutely correct that I created a dummy Dstream (to prevent Flume channel filling up) and counted the messages but I didn't output(print), hence is why it reported that error. Since I called print(), the error is no longer is being

Re: Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-25 Thread Petr Novak
Many thanks Cody, it explains quite a bit. I had couple of problems with checkpointing and graceful shutdown moving from working code in Spark 1.3.0 to 1.5.0. Having InterruptedExceptions, KafkaDirectStream couldn't initialize, some exceptions regarding WAL even I'm using direct stream. Meanwhile

Re: how to submit the spark job outside the cluster

2015-09-25 Thread Zhiliang Zhu
Hi Steve, Thanks a lot for your reply. That is, some commands could work on the remote server gateway installed , but some other commands will not work.As expected, the remote machine is not in the same area network as the cluster, and the cluster's portis forbidden. While I make the remote machi

Re: how to submit the spark job outside the cluster

2015-09-25 Thread Steve Loughran
On 25 Sep 2015, at 05:25, Zhiliang Zhu mailto:zchl.j...@yahoo.com.INVALID>> wrote: However, I just could use "hadoop fs -ls/-mkdir/-rm XXX" commands to operate at the remote machine with gateway, which means the namenode is reachable; all those commands only need to interact with it. but c

Re: Exception on save s3n file (1.4.1, hadoop 2.6)

2015-09-25 Thread Steve Loughran
On 25 Sep 2015, at 03:35, Zhang, Jingyu mailto:jingyu.zh...@news.com.au>> wrote: I got following exception when I run JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out? thanks 15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext Exception in threa

Re: Using Spark for portfolio manager app

2015-09-25 Thread Adrian Tanase
Re: DB I strongly encourage you to look at Cassandra – it’s almost as powerful as Hbase, a lot easier to setup and manage. Well suited for this type of usecase, with a combination of K/V store and time series data. For the second question, I’ve used this pattern all the time for “flash messages

  1   2   >