Join two dataframe - Timeout after 5 minutes

2015-09-24 Thread Eyad Sibai
I am trying to join two tables using dataframes using python 3.4 and I am getting the following error I ran it on my localhost machine with 2 workers, spark 1.5 I always get timeout if the job takes more than 5 minutes. at

GroupBy Java objects in Java Spark

2015-09-24 Thread Ramkumar V
Hi, I want to know whether grouping by java class objects is possible or not in java Spark. I have Tuple2< JavaObject, JavaObject>. i want to groupbyKey and then i'll do some operations in values after grouping. *Thanks*,

Re: JdbcRDD Constructor

2015-09-24 Thread satish chandra j
HI Deenar, Please find the SQL query below: var SQL_RDD= new JdbcRDD( sc, ()=> DriverManager.getConnection(url,user,pass),"select col1, col2, col3..col 37 from schema.Table LIMIT ? OFFSET ?",100,0,*1*,(r: ResultSet) => (r.getInt("col1"),r.getInt("col2")...r.getInt("col37"))) When I

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Adrian Tanase
RE: # because I already have a bunch of InputSplits, do I still need to specify the number of executors to get processing parallelized? I would say it’s best practice to have as many executors as data nodes and as many cores as you can get from the cluster – if YARN has enough resources it

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-24 Thread satish chandra j
HI All, As it is for SQL purpose I understand, need to go ahead with Custom Case Class approach Could anybody have a sample code for creating Custom Case Class to refer which would be really helpful Regards, Satish Chandra On Thu, Sep 24, 2015 at 2:51 PM, Adrian Tanase wrote:

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Sabarish Sasidharan
A little caution is needed as one executor per node may not always be ideal esp when your nodes have lots of RAM. But yes, using lesser number of executors has benefits like more efficient broadcasts. Regards Sab On 24-Sep-2015 2:57 pm, "Adrian Tanase" wrote: > RE: # because

Exception during SaveAstextFile Stage

2015-09-24 Thread Chirag Dewan
Hi, I have 2 stages in my job map and save as text file. During the save text file stage I am getting an exception : 15/09/24 15:38:16 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-24 Thread satish chandra j
HI All, In addition to Case Class limitation in Scala, I finding Tuple limitation too please find the explanation below //Query to pull data from Source Table var SQL_RDD= new JdbcRDD( sc, ()=> DriverManager.getConnection(url,user,pass),"select col1, col2, col3..col 37 from schema.Table

Re: GroupBy Java objects in Java Spark

2015-09-24 Thread Sabarish Sasidharan
By java class objects if you mean your custom Java objects, yes of course. That will work. Regards Sab On 24-Sep-2015 3:36 pm, "Ramkumar V" wrote: > Hi, > > I want to know whether grouping by java class objects is possible or not > in java Spark. > > I have Tuple2<

Legacy Python code

2015-09-24 Thread Joshua Fox
I managed to put together a first Spark application on top of my existing codebase. But I am still puzzled by the best way to deploy legacy Python code. Can't I just put my codebase in some directory on the slave machines? Existing solutions: 1. Rewrite everything in terms of Spark

Re: Re: How to fix some WARN when submit job on spark 1.5 YARN

2015-09-24 Thread r7raul1...@163.com
Thank you r7raul1...@163.com From: Sean Owen Date: 2015-09-24 16:18 To: r7raul1...@163.com CC: user Subject: Re: How to fix some WARN when submit job on spark 1.5 YARN You can ignore all of these. Various libraries can take advantage of native acceleration if libs are available but it's no

Not fetching all records from Cassandra DB

2015-09-24 Thread satish chandra j
HI All, Not sure why all records are not retrieved from Cassasndra even though there are no condition applied in SQL query executed on Cassandra SQL Context in Spark 1.2.2 version Note: Its a simple lookup purpose table which has only 10 to 15 records Please let me know if any inputs on the

Re: No space left on device when running graphx job

2015-09-24 Thread Ted Yu
Andy: Can you show complete stack trace ? Have you checked there are enough free inode on the .129 machine ? Cheers > On Sep 23, 2015, at 11:43 PM, Andy Huang wrote: > > Hi Jack, > > Are you writing out to disk? Or it sounds like Spark is spilling to disk (RAM >

Re: spark + parquet + schema name and metadata

2015-09-24 Thread Borisa Zivkovic
Hi, your suggestion works nicely.. I was able to attach metadata to columns and read that metadata from spark and by using ParquetFileReader It would be nice if we had a way to manipulate parquet metadata directly from DataFrames though. regards On Wed, 23 Sep 2015 at 09:25 Borisa Zivkovic

Re: Spark on YARN / aws - executor lost on node restart

2015-09-24 Thread Adrian Tanase
Closing the loop, I’ve submitted this issue – TD, cc-ing you since it’s spark streaming, not sure who oversees the Yarn module. https://issues.apache.org/jira/browse/SPARK-10792 -adrian From: Adrian Tanase Date: Friday, September 18, 2015 at 6:18 PM To:

Re: Querying on multiple Hive stores using Apache Spark

2015-09-24 Thread Karthik
Any ideas or suggestions? Thanks, Karthik. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Querying-on-multiple-Hive-stores-using-Apache-Spark-tp24765p24797.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

No space left on device when running graphx job

2015-09-24 Thread Jack Yang
Hi folk, I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G memory + 4 CPU cores) Basically, I load data using GraphLoader.edgeListFile mthod and then count number of nodes using: graph.vertices.count() method. The problem is : Lost task 11972.0 in stage 6.0 (TID 54585, 192.168.70.129):

Re: Join two dataframe - Timeout after 5 minutes

2015-09-24 Thread Shixiong Zhu
You can change "spark.sql.broadcastTimeout" to increase the timeout. The default value is 300 seconds. Best Regards, Shixiong Zhu 2015-09-24 15:16 GMT+08:00 Eyad Sibai : > I am trying to join two tables using dataframes using python 3.4 and I am > getting the following

Re: Hbase Spark streaming issue.

2015-09-24 Thread Shixiong Zhu
Looks like you have an incompatible hbase-default.xml in some place. You can use the following code to find the location of "hbase-default.xml" println(Thread.currentThread().getContextClassLoader().getResource("hbase-default.xml")) Best Regards, Shixiong Zhu 2015-09-21 15:46 GMT+08:00 Siva

Re: How to make Group By/reduceByKey more efficient?

2015-09-24 Thread Adrian Tanase
All the *ByKey aggregations perform an efficient shuffle and preserve partitioning on the output. If all you need is to call reduceByKey, then don’t bother with groupBy. You should use groupBy if you really need all the datapoints from a key for a very custom operation. From the docs: Note:

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-24 Thread Jonathan Kelly
I cut https://issues.apache.org/jira/browse/SPARK-10790 for this issue. On Wed, Sep 23, 2015 at 8:38 PM, Jonathan Kelly wrote: > AHA! I figured it out, but it required some tedious remote debugging of > the Spark ApplicationMaster. (But now I understand the Spark

Re: How to fix some WARN when submit job on spark 1.5 YARN

2015-09-24 Thread Sean Owen
You can ignore all of these. Various libraries can take advantage of native acceleration if libs are available but it's no problem if they don't. On Thu, Sep 24, 2015 at 3:25 AM, r7raul1...@163.com wrote: > 1 WARN netlib.BLAS: Failed to load implementation from: >

Fwd: Spark streaming DStream state on worker

2015-09-24 Thread Shixiong Zhu
+user, -dev It's not clear about `compute` in your question. There are two `compute` here. 1. DStream.compute: it always runs in the driver, and all RDDs are created in the driver. E.g., DStream.foreachRDD(rdd => rdd.count()) "rdd.count()" is called in the driver. 2. RDD.compute: this will

Re: No space left on device when running graphx job

2015-09-24 Thread Andy Huang
Hi Jack, Are you writing out to disk? Or it sounds like Spark is spilling to disk (RAM filled up) and it's running out of disk space. Cheers Andy On Thu, Sep 24, 2015 at 4:29 PM, Jack Yang wrote: > Hi folk, > > > > I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G

Re: Creating BlockMatrix with java API

2015-09-24 Thread Sabarish Sasidharan
What I meant is that something like this would work. Yes, it's less than elegant but it works. List, Matrix>> blocks = new ArrayList,Matrix>>(); blocks.add( new Tuple2, Matrix>( new Tuple2(0, 0),

Re: JobScheduler: Error generating jobs for time for custom InputDStream

2015-09-24 Thread Shixiong Zhu
Looks like you returns a "Some(null)" in "compute". If you don't want to create a RDD, it should return None. If you want to return an empty RDD, it should return "Some(sc.emptyRDD)". Best Regards, Shixiong Zhu 2015-09-15 2:51 GMT+08:00 Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com>:

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-24 Thread Adrian Tanase
+1 on grouping the case classes and creating a hierarchy – as long as you use the data programatically. For DataFrames / SQL the other ideas probably scale better… From: Ted Yu Date: Wednesday, September 23, 2015 at 7:07 AM To: satish chandra j Cc: user Subject: Re: Scala Limitation - Case

kafka direct streaming with checkpointing

2015-09-24 Thread Radu Brumariu
Hi, in my application I use Kafka direct streaming and I have also enabled checkpointing. This seems to work fine if the application is restarted. However if I change the code and resubmit the application, it cannot start because of the checkpointed data being of different class versions. Is there

Re: Java Heap Space Error

2015-09-24 Thread Yusuf Can Gürkan
@Jingyu Yes, it works without regex and concatenation as the query below: So, what we can understand from this? Because when i do like that, shuffle read sizes are equally distributed between partitions. val usersInputDF = sqlContext.sql( s""" | select userid from landing where

RE: Java Heap Space Error

2015-09-24 Thread java8964
This is interesting. So you mean that query as "select userid from landing where dt='2015-9' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid" works in your cluster? In this case, do you also see this one task with way more data than

Re: reduceByKeyAndWindow confusion

2015-09-24 Thread Adrian Tanase
Let me take a stab at your questions – can you clarify some of the points below? I’m wondering if you’re using the streaming concepts as they were intended… 1. Windowed operations First, I just want to confirm that it is your intention to split the original kafka stream into multiple Dstreams

Re: kafka direct streaming with checkpointing

2015-09-24 Thread Cody Koeninger
No, you cant use checkpointing across code changes. Either store offsets yourself, or start up your new app code and let it catch up before killing the old one. On Thu, Sep 24, 2015 at 8:40 AM, Radu Brumariu wrote: > Hi, > in my application I use Kafka direct streaming and I

Re: Spark ClosureCleaner or java serializer OOM when trying to grow

2015-09-24 Thread Ted Yu
Please decrease spark.serializer.objectStreamReset for your queries. The default value is 100. I logged SPARK-10787 for improvement. Cheers On Wed, Sep 23, 2015 at 6:59 PM, jluan wrote: > I have been stuck on this problem for the last few days: > > I am attempting to run

Networking issues with Spark on EC2

2015-09-24 Thread SURAJ SHETH
Hi, I am using Spark 1.2 and facing network related issues while performing simple computations. This is a custom cluster set up using ec2 machines and spark prebuilt binary from apache site. The problem is only when we have workers on other machines(networking involved). Having a single node

Re: Long running Spark Streaming Job increasing executing time per batch

2015-09-24 Thread Jeremy Smith
I found a similar issue happens when there is a memory leak in the spark application (or, in my case, one of the libraries that's used in the spark application). Gradually, unclaimed objects make their way into old or permanent generation space, reducing the available heap. It causes GC overhead

Reading avro data using KafkaUtils.createDirectStream

2015-09-24 Thread Daniel Haviv
Hi, I'm trying to use KafkaUtils.createDirectStream to read avro messages from Kafka but something is off with my type arguments: val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord], GenericRecord, NullWritable, AvroInputFormat[GenericRecord]](ssc, kafkaParams, topicSet) I'm

Re: Reading avro data using KafkaUtils.createDirectStream

2015-09-24 Thread Cody Koeninger
Your third and fourth type parameters need to be subclasses of kafka.serializer.Decoder On Thu, Sep 24, 2015 at 10:30 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm trying to use KafkaUtils.createDirectStream to read avro messages from > Kafka but something is off with my

Re: Scala api end points

2015-09-24 Thread Ndjido Ardo BAR
Hi Masoom Alam, I successfully experimented the following project on Github https://github.com/erisa85/WikiSparkJobServer . I do recommand it to you. cheers, Ardo. On Thu, Sep 24, 2015 at 5:20 PM, masoom alam wrote: > Hi everyone > > I am new to Scala. I have a

Re: Java Heap Space Error

2015-09-24 Thread Yusuf Can Gürkan
Yes right, the query you wrote worked in same cluster. In this case, partitions were equally distributed but when i used regex and concetanations it’s not as i said before. Query with concetanation is below: val usersInputDF = sqlContext.sql( s""" | select userid,concat_ws('

Re: Remove duplicate keys by always choosing first in file.

2015-09-24 Thread Philip Weaver
Oops, I didn't catch the suggestion to just use RDD.zipWithIndex, which I forgot existed (and I've discoverd I actually used in another project!). I will use that instead of the mapPartitionsWithIndex/zipWithIndex solution that I posted originally. On Tue, Sep 22, 2015 at 9:07 AM, Philip Weaver

Scala api end points

2015-09-24 Thread masoom alam
Hi everyone I am new to Scala. I have a written an application using scala in spark Now we want to interface it through rest api end points..what is the best choice with usplease share ur experiences Thanks

Re: spark + parquet + schema name and metadata

2015-09-24 Thread Cheng Lian
Thanks for the feedback, just filed https://issues.apache.org/jira/browse/SPARK-10803 to track this issue. Cheng On 9/24/15 4:25 AM, Borisa Zivkovic wrote: Hi, your suggestion works nicely.. I was able to attach metadata to columns and read that metadata from spark and by using

NegativeArraySizeException on Spark SQL window function

2015-09-24 Thread Bae, Jae Hyeon
Hi Spark users Can somebody explain about the following WARN with exception? I am running Spark 1.5.0 and the job was successful but I am wondering whether it's totally OK to keep using Spark SQL window function 15/09/24 06:31:49 WARN TaskSetManager: Lost task 17.0 in stage 4.0 (TID 18907,

Re: SparkContext declared as object variable

2015-09-24 Thread Priya Ch
object StreamJob { val conf = new SparkConf val sc = new SparkContext(conf) def main(args:Array[String]) { val baseRDD = sc.parallelize(Array("hi","hai","hi","bye","bye","hi","hai","hi","bye","bye")) val words = baseRDD.flatMap(line => line.split(",")) val wordPairs =

reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread swetha
Hi, How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of keys for which I need to do sum and average inside the updateStateByKey by joining with old state. How do I accomplish that? Thanks, Swetha -- View this message in context:

Re: JdbcRDD Constructor

2015-09-24 Thread Deenar Toraskar
On 24 September 2015 at 17:48, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > you are interpreting the JDBCRDD API incorrectly. If you want to use > partitions, then the column used to partition and present in the where > clause must be numeric and the lower bound and upper bound

why more than more jobs in a batch in spark streaming ?

2015-09-24 Thread Shenghua(Daniel) Wan
Hi, I noticed that in my streaming application reading from Kafka using multiple receivers, there are 3 jobs in one batch (via web UI). According to DAG there are two stages, job 0 execute both 2 stages, but job 1 and job 2 only execute stage 2. There is a disconnection between my understanding

Re: Java Heap Space Error

2015-09-24 Thread Yusuf Can Gürkan
Thank you very much. This makes sense. I will write after try your solution. > On 24 Sep 2015, at 22:43, java8964 wrote: > > I can understand why your first query will finish without OOM, but the new > one will fail with OOM. > > In the new query, you are asking a

Re: Querying on multiple Hive stores using Apache Spark

2015-09-24 Thread Michael Armbrust
This is not supported yet, though, we laid a lot of the ground work for doing this in Spark 1.4. On Wed, Sep 23, 2015 at 11:17 PM, Karthik wrote: > Any ideas or suggestions? > > Thanks, > Karthik. > > > > -- > View this message in context: >

Re: why more than more jobs in a batch in spark streaming ?

2015-09-24 Thread Tathagata Das
Are you using DStream.print()? Or something that boils down to RDD.take()? That can lead to an unpredictable number of jobs. There are other cases as well, but this one is common. On Thu, Sep 24, 2015 at 12:04 PM, Shenghua(Daniel) Wan < wansheng...@gmail.com> wrote: > Hi, > I noticed that in my

Re: kafka direct streaming with checkpointing

2015-09-24 Thread Radu Brumariu
It seems to me that this scenario that I'm facing, is quite common for spark jobs using Kafka. Is there a ticket to add this sort of semantics to checkpointing ? Does it even make sense to add it there ? Thanks, Radu On Thursday, September 24, 2015, Cody Koeninger wrote: >

Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Mark Hamstra
Where do you see a race in the DAGScheduler? On a quick look at your stack trace, this just looks to me like a Job where a Stage failed and then the DAGScheduler aborted the failed Job. On Thu, Sep 24, 2015 at 12:00 PM, robin_up wrote: > Hi > > After upgrade to 1.5, we

Re: kafka direct streaming with checkpointing

2015-09-24 Thread Cody Koeninger
This has been discussed numerous times, TD's response has consistently been that it's unlikely to be possible On Thu, Sep 24, 2015 at 12:26 PM, Radu Brumariu wrote: > It seems to me that this scenario that I'm facing, is quite common for > spark jobs using Kafka. > Is there a

Large number of conf broadcasts

2015-09-24 Thread Anders Arpteg
Hi, Running spark 1.5.0 in yarn-client mode, and am curios in why there are so many broadcast being done when loading datasets with large number of partitions/files. Have datasets with thousands of partitions, i.e. hdfs files in the avro folder, and sometime loading hundreds of these large

Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Josh Rosen
I believe that this is an instance of https://issues.apache.org/jira/browse/SPARK-10422, which should be fixed in upcoming 1.5.1 release. On Thu, Sep 24, 2015 at 12:52 PM, Mark Hamstra wrote: > Where do you see a race in the DAGScheduler? On a quick look at your >

Re: reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread Adrian Tanase
The 2 operations can't be used inside one another. If you need something like an all time average then you need to keep a tuple (sum, count) to which you add all the new values that come in every batch. The average is then just a map on the state DStream. Makes sense? have I guessed your use

RE: Join over many small files

2015-09-24 Thread Tracewski, Lukasz
Thanks for answer! Why sequence files though, why not to work directly on RDDs? My input files are CSVs and often contain some garbage both at the beginning and end of a file. Mind that I am working in Python, I am not sure if it will be as efficient as intended. Any examples in PySpark will

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Sourabh Chandak
Here is the code snippet, starting line 365 in KafkaCluster.scala: type Err = ArrayBuffer[Throwable] /** If the result is right, return it, otherwise throw SparkException */ def checkErrors[T](result: Either[Err, T]): T = { result.fold( errs => throw new

Re: Reading Hive Tables using SQLContext

2015-09-24 Thread Michael Armbrust
No, you have to use a HiveContext. On Thu, Sep 24, 2015 at 2:47 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hello, > > Is it possible to access Hive tables directly from SQLContext instead of > HiveContext? I am facing with errors while doing it. > > Please let me know >

Re: Networking issues with Spark on EC2

2015-09-24 Thread Ankur Srivastava
Hi Suraj, Spark uses a lot of ports to communicate between nodes. Probably your security group is restrictive and does not allow instances to communicate on all networks. The easiest way to resolve it is to add a Rule to allow all Inbound traffic on all ports (0-65535) to instances in same

Reasonable performance numbers?

2015-09-24 Thread Young, Matthew T
Hello, I am doing performance testing with Spark Streaming. I want to know if the throughput numbers I am encountering are reasonable for the power of my cluster and Spark's performance characteristics. My job has the following processing steps: 1. Read 600 Byte JSON strings from a 7

Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Robin Li
Josh Looked closer, I think you are correct, not a racing condition. This only shows up on persisting string, other data format looks fine. Also whe we reverted to 1.4 the issue's gone. Thanks On Thursday, 24 September 2015, Josh Rosen wrote: > I believe that this is an

Reading Hive Tables using SQLContext

2015-09-24 Thread Sathish Kumaran Vairavelu
Hello, Is it possible to access Hive tables directly from SQLContext instead of HiveContext? I am facing with errors while doing it. Please let me know Thanks Sathish

Re: kafka direct streaming with checkpointing

2015-09-24 Thread Radu Brumariu
Would changing the direct stream api to support committing the offsets to kafka's ZK( like a regular consumer) as a fallback mechanism, in case recovering from checkpoint fails , be an accepted solution? On Thursday, September 24, 2015, Cody Koeninger wrote: > This has been

executor-cores setting does not work under Yarn

2015-09-24 Thread Gavin Yue
Running Spark app over Yarn 2.7 Here is my sparksubmit setting: --master yarn-cluster \ --num-executors 100 \ --executor-cores 3 \ --executor-memory 20g \ --driver-memory 20g \ --driver-cores 2 \ But the executor cores setting is not working. It always assigns only one vcore to one

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Cody Koeninger
That looks like the OOM is in the driver, when getting partition metadata to create the direct stream. In that case, executor memory allocation doesn't matter. Allocate more driver memory, or put a profiler on it to see what's taking up heap. On Thu, Sep 24, 2015 at 3:51 PM, Sourabh Chandak

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Sourabh Chandak
I was able to get pass this issue. I was pointing the SSL port whereas SimpleConsumer should point to the PLAINTEXT port. But after fixing that I am getting the following error: Exception in thread "main" org.apache.spark.SparkException: java.nio.BufferUnderflowException at

Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-24 Thread Uthayan Suthakar
Hello all, My Stream job is throwing below exception at every interval. It is first deleting the the checkpoint file and then it's trying to checkpoint, is this normal behaviour? I'm using Spark 1.3.0. Do you know what may cause this issue? 15/09/24 16:35:55 INFO scheduler.TaskSetManager:

Re: Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-24 Thread Tathagata Das
Are you by any chance setting DStream.remember() with null? On Thu, Sep 24, 2015 at 5:02 PM, Uthayan Suthakar < uthayan.sutha...@gmail.com> wrote: > Hello all, > > My Stream job is throwing below exception at every interval. It is first > deleting the the checkpoint file and then it's trying to

Re: Unable to start spark-shell on YARN

2015-09-24 Thread Doug Balog
The error is because the shell is trying to resolve hdp.version and can’t. To fix this, you need to put a file called java-opts in your conf directory that has something like this. -Dhdp.version=2.x.x.x Where 2.x.x.x is there version of hdp that you are using. Cheers, Doug > On Sep 24, 2015,

Stop a Dstream computation

2015-09-24 Thread Samya
Hi Team, I have a code piece as follows. try{ someDstream.someaction(...) //Step1 }catch{ case ex:Exception =>{ someDstream.someaction(...) //Step2 } } When I get an exception for current batch, Step2 executes as

RE: No space left on device when running graphx job

2015-09-24 Thread Jack Yang
Hi all, I resolved the problems. Thanks folk. Jack From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 25 September 2015 9:57 AM To: Ted Yu; Andy Huang Cc: user@spark.apache.org Subject: RE: No space left on device when running graphx job Also, please see the screenshot below from spark web

Re: Unable to start spark-shell on YARN

2015-09-24 Thread ๏̯͡๏
This is resolved now On Thu, Sep 24, 2015 at 7:47 PM Doug Balog wrote: > The error is because the shell is trying to resolve hdp.version and can’t. > To fix this, you need to put a file called java-opts in your conf > directory that has something like this. > >

Re: Spark ClosureCleaner or java serializer OOM when trying to grow

2015-09-24 Thread jluan
With spark.serializer.objectStreamReset set to 1, I ran a sample scala test code which still seems to be crashing at the same place. If someone could verify this independently, I would greatly appreciate it. Scala Code: -- import

Re: spark.mesos.coarse impacts memory performance on mesos

2015-09-24 Thread Utkarsh Sengar
Bumping this one up, any suggestions on the stacktrace? spark.mesos.coarse=true is not working and the driver crashed with the error. On Wed, Sep 23, 2015 at 3:29 PM, Utkarsh Sengar wrote: > Missed to do a reply-all. > > Tim, > > spark.mesos.coarse = true doesn't work and

Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-24 Thread Dominic Ricard
Hi, I stumbled on the following today. We have Parquet files that expose a column in a Map format. This is very convenient as we have data parts that can vary in time. Not knowing what the data will be, we simply split it in tuples and insert it as a map inside 1 column. Retrieving the data is

RE: Java Heap Space Error

2015-09-24 Thread java8964
I can understand why your first query will finish without OOM, but the new one will fail with OOM. In the new query, you are asking a groupByKey/cogroup operation, which will force all the productName + prodcutionCatagory per user id sent to the same reducer. This could easily below out

Re: Reading Hive Tables using SQLContext

2015-09-24 Thread Sathish Kumaran Vairavelu
Thanks Michael. Just want to check if there is a roadmap to include Hive tables from SQLContext. -Sathish On Thu, Sep 24, 2015 at 7:46 PM Michael Armbrust wrote: > No, you have to use a HiveContext. > > On Thu, Sep 24, 2015 at 2:47 PM, Sathish Kumaran Vairavelu < >

Exception on save s3n file (1.4.1, hadoop 2.6)

2015-09-24 Thread Zhang, Jingyu
I got following exception when I run JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out? thanks 15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException at

RE: Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-24 Thread Dominic Ricard
No, those were just examples on how maps can look like. In my case, the key-value is either there or not in the form of the later: {"key1":{"key2":"value"}} If key1 is present, then it will contain a tuple of key2:value, value being a 'int' I guess, after some testing, that my problem is on

Re: how to submit the spark job outside the cluster

2015-09-24 Thread Zhiliang Zhu
Hi Zhan, I have done that as your kind help. However, I just could use "hadoop fs -ls/-mkdir/-rm XXX" commands to operate at the remote machine with gateway, but commands "hadoop fs -cat/-put XXX    YYY" would not work with error message as below: put: File

ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Sourabh Chandak
Hi, I have ported receiver less spark streaming for kafka to Spark 1.2 and am trying to run a spark streaming job to consume data form my broker, but I am getting the following error: 15/09/24 20:17:45 ERROR BoundedByteBufferReceive: OOME with size 352518400 java.lang.OutOfMemoryError: Java heap

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Sourabh Chandak
Adding Cody and Sriharsha On Thu, Sep 24, 2015 at 1:25 PM, Sourabh Chandak wrote: > Hi, > > I have ported receiver less spark streaming for kafka to Spark 1.2 and am > trying to run a spark streaming job to consume data form my broker, but I > am getting the following

Re: Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-24 Thread Terry Hoo
I met this before: in my program, some DStreams are not initialized since they are not in the path of of output. You can check if you are the same case. Thanks! - Terry On Fri, Sep 25, 2015 at 10:22 AM, Tathagata Das wrote: > Are you by any chance setting