Re: SparkContext and JavaSparkContext

2015-06-29 Thread Hao Ren
It seems that JavaSparkContext is just a wrapper of scala sparkContext. In JavaSparkContext, the scala one is used to do all the job. If I pass the same scala sparkContext to initialize JavaSparkContext, I still manipulate on the same sparkContext. Sry for spamming. Hao On Mon, Jun 29, 2015

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Dave Ariens
I'd like to toss out another idea that doesn't involve a complete end-to-end Kerberos implementation. Essentially, have the driver authenticate to Kerberos, instantiate a Hadoop file system, and serialize/cache it for the executors to use instead of them having to instantiate their own. -

Re: Programming with java on spark

2015-06-29 Thread 付雅丹
Hi, Akhil. Thank you for your reply. I tried what you suggested. But it exists the following error. source code is: JavaPairRDDLongWritable,Text distFile=sc.hadoopFile( hdfs://cMaster:9000/wcinput/data.txt, DataInputFormat.class,LongWritable.class,Text.class); while DataInputFormat class is

Schema for type is not supported

2015-06-29 Thread Sander van Dijk
Hey all, I try to make a DataFrame by inspection (using Spark 1.4.0), but run into a parameter of my case class not being supported. Minimal example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ import com.vividsolutions.jts.geom.Coordinate case class

Re: load Java properties file in Spark

2015-06-29 Thread ayan guha
You may want to store the propertyfile either locally or ifyour intend to launch it in yarn modes then in HDFS (asyou do notknow which node wil become your AM) On Mon, Jun 29, 2015 at 10:51 PM, diplomatic Guru diplomaticg...@gmail.com wrote: I want to store the Spark application arguments such

RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-29 Thread prajod.vettiyattil
Hi, Any {fan-out - process in parallel - fan-in - aggregate} pattern of data flow can be conceptually Map-Reduce(MR, as it is done in Hadoop). Apart from the bigger list of map, reduce, sort, filter, pipe, join, combine,... functions, that are many times more efficient and productive for

Re: How to recover in case user errors in streaming

2015-06-29 Thread Amit Assudani
Thanks TD, this helps. Looking forward to some fix where framework handles the batch failures by some callback methods. This will help not having to write try/catch in every transformation / action. Regards, Amit From: Tathagata Das t...@databricks.commailto:t...@databricks.com Date:

Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi, I'm trying to run Spark without Hadoop where the data would be read and written to local disk. For this I have few Questions - 1. Which download I need to use ? In the download option I don't see any binary download which does not need Hadoop. Is the only way to do this to download the

Re: Dependency Injection with Spark Java

2015-06-29 Thread Michal Čizmazia
Currently, I am considering to use Guava Suppliers for delayed initialization in workers SupplierT supplier = (Serializable SupplierT) () - new T(); SupplierT singleton = Suppliers.memoize(supplier); On 26 June 2015 at 13:17, Igor Berman igor.ber...@gmail.com wrote: asked myself same

Re: spark streaming with kafka reset offset

2015-06-29 Thread Cody Koeninger
3. You need to use your own method, because you need to set up your job. Read the checkpoint documentation. 4. Yes, if you want to checkpoint, you need to specify a url to store the checkpoint at (s3 or hdfs). Yes, for the direct stream checkpoint it's just offsets, not all the messages. On

Re: spark streaming with kafka reset offset

2015-06-29 Thread ayan guha
Hi Let me take ashot at your questions. (I am sure people like Cody and TD will correct if I am wrong) 0. This is exact copy from the similar question in mail thread from Akhil D: Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are

Directory creation failed leads to job fail (should it?)

2015-06-29 Thread maxdml
Hi there, I have some traces from my master and some workers where for some reason, the ./work directory of an application can not be created on the workers. There is also an issue with the master's temp directory creation. master logs: http://pastebin.com/v3NCzm0u worker's logs:

load Java properties file in Spark

2015-06-29 Thread diplomatic Guru
I want to store the Spark application arguments such as input file, output file into a Java property files and pass that file into Spark Driver. I'm using spark-submit for submitting the job but couldn't find a parameter to pass the properties file. Have you got any suggestions?

Re: Directory creation failed leads to job fail (should it?)

2015-06-29 Thread ayan guha
No, spark can not do that as it does not replicate partitions (so no retry on different worker). It seems your cluster is not provisioned with correct permissions. I would suggest to automate node provisioning. On Mon, Jun 29, 2015 at 11:04 PM, maxdml maxdemou...@gmail.com wrote: Hi there, I

Re: Fine control with sc.sequenceFile

2015-06-29 Thread Koert Kuipers
see also: https://github.com/apache/spark/pull/6848 On Mon, Jun 29, 2015 at 12:48 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: sc.hadoopConfiguration.set(mapreduce.input.fileinputformat.split.maxsize, 67108864) sc.sequenceFile(getMostRecentDirectory(tablePath, _.startsWith(_)).get + /*,

Re: Time series data

2015-06-29 Thread tog
Hi Have you tested the Cloudera project: https://github.com/cloudera/spark-timeseries ? Let me know how did you progress on that route as I am also interested in that topic ? Cheers On 26 June 2015 at 14:07, Caio Cesar Trucolo truc...@gmail.com wrote: Hi everyone! I am working with

Re: Union of many RDDs taking a long time

2015-06-29 Thread Tomasz Fruboes
Hi Matt, is there a reason you need to call coalesce every loop iteration? Most likely it forces spark to do lots of unnecessary shuffles. Also - for really large number of inputs this approach can lead to due to to many nested RDD.union calls. A safer approach is to call union from

Re: Scala problem when using g.vertices.map not a member of type parameter

2015-06-29 Thread Robineast
I can't see an obvious problem. Could you post the full minimal code that reproduces the problem? Also why version of Spark and Scala are you using? -- View this message in context:

got java.lang.reflect.UndeclaredThrowableException when running multiply APPs in spark

2015-06-29 Thread luohui20001
Hi there I am running 30 APPs in my spark cluster, and some of the APPs got exception like below:[root@slave3 0]# cat stderr 15/06/29 17:20:08 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 15/06/29 17:20:09 WARN util.NativeCodeLoader: Unable

Spark Streaming-Receiver drops data

2015-06-29 Thread summerdaway
Hi, I'm using spark streaming to process data. I do a simple flatMap on each record as follows package bb; import java.io.*; import java.net.*; import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.net.URI; import java.util.List; import

RE: spark-submit in deployment mode with the --jars option

2015-06-29 Thread Hisham Mohamed
Hi Akhil, Thanks for your reply. Here it is Launch Command: /usr/lib/jvm/java-7-oracle/jre/bin/java -cp /etc/spark/:/opt/spark/lib/spark-assembly-1.4.0-hadoop2.3.0.jar:/opt/spark-1.4.0-bin-hadoop2.3/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.4.0-bin

Serialization Exception

2015-06-29 Thread Spark Enthusiast
For prototyping purposes, I created a test program injecting dependancies using Spring. Nothing fancy. This is just a re-write of KafkaDirectWordCount. When I run this, I get the following exception: Exception in thread main org.apache.spark.SparkException: Task not serializable     at

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Dave Ariens
Thanks, Steve--I should have tested out this theory before spamming the list. I haven't been able to get anything working after testing this theory out. I'll hit up the Spark dev mailing list and try to garner enough interest to get some Jira's cut. I really appreciate everyone's feedback,

SPARK REMOTE DEBUG

2015-06-29 Thread Pietro Gentile
Hi all, What is the best way to remotely debug, with breakpoints, spark apps? Thanks in advance, Best regards! Pietro

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Ted Yu
Sourav: Please see https://spark.apache.org/docs/latest/spark-standalone.html Cheers On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote: Hi You really donot need hadoop installation. You can dowsload a pre-built version with any hadoop and unzip it and you are good to go.

Re: spark streaming with kafka reset offset

2015-06-29 Thread Shushant Arora
1. Here you are basically creating 2 receivers and asking each of them to consume 3 kafka partitions each. - In 1.2 we have high level consumers so how can we restrict no of kafka partitions to consume from? Say I have 300 kafka partitions in kafka topic and as in above I gave 2 receivers and 3

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread ayan guha
Hi You really donot need hadoop installation. You can dowsload a pre-built version with any hadoop and unzip it and you are good to go. Yes it may complain while launching master and workers, safely ignore them. The only problem is while writing to a directory. Of course you will not be able to

Re: How to recover in case user errors in streaming

2015-06-29 Thread Amit Assudani
Also, how do you suggest catching exceptions while using with connector API like, saveAsNewAPIHadoopFiles ? From: amit assudani aassud...@impetus.commailto:aassud...@impetus.com Date: Monday, June 29, 2015 at 9:55 AM To: Tathagata Das t...@databricks.commailto:t...@databricks.com Cc: Cody

Re: Directory creation failed leads to job fail (should it?)

2015-06-29 Thread Max Demoulin
The underlying issue is a filesystem corruption on the workers. In the case where I use hdfs, with a sufficient amount of replica, would Spark try to launch a task on another node where the block replica is present? Thanks :-) -- Henri Maxime Demoulin 2015-06-29 9:10 GMT-04:00 ayan guha

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Steve Loughran
On 29 Jun 2015, at 14:18, Dave Ariens dari...@blackberry.commailto:dari...@blackberry.com wrote: I'd like to toss out another idea that doesn't involve a complete end-to-end Kerberos implementation. Essentially, have the driver authenticate to Kerberos, instantiate a Hadoop file system, and

Spark shell crumbles after memory is full

2015-06-29 Thread hbogert
I'm running a query from the BigDataBenchmark, query 1B to be precise. When running this with Spark (1.3.1)+ mesos(0.21) in coarse grained mode with 5 mesos slave, through a spark shell, all is well. However rerunning the query a few times: scala sqlContext.sql(SELECT pageURL, pageRank FROM

Re: Spark shell crumbles after memory is full

2015-06-29 Thread Mark Hamstra
No. He is collecting the results of the SQL query, not the whole dataset. The REPL does retain references to prior results, so it's not really the best tool to be using when you want no-longer-needed results to be automatically garbage collected. On Mon, Jun 29, 2015 at 9:13 AM, ayan guha

Re: Directory creation failed leads to job fail (should it?)

2015-06-29 Thread ayan guha
It's a scheduler question. Spark will retry the task on the same worker. From spark standpoint data is not replicated because spark provides fault tolerance but lineage not by replication. On 30 Jun 2015 01:50, Max Demoulin maxdemou...@gmail.com wrote: The underlying issue is a filesystem

Re: Directory creation failed leads to job fail (should it?)

2015-06-29 Thread Max Demoulin
I see. Thank you for your help! -- Henri Maxime Demoulin 2015-06-29 11:57 GMT-04:00 ayan guha guha.a...@gmail.com: It's a scheduler question. Spark will retry the task on the same worker. From spark standpoint data is not replicated because spark provides fault tolerance but lineage not by

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Actually, Hadoop InputFormats can still be used to read and write from file://, s3n://, and similar schemes. You just won't be able to read/write to HDFS without installing Hadoop and setting up an HDFS cluster. To summarize: Sourav, you can use any of the prebuilt packages (i.e. anything other

Re: SPARK REMOTE DEBUG

2015-06-29 Thread Ted Yu
Please read: http://search-hadoop.com/m/q3RTtig4WhHTdMa1 FYI On Mon, Jun 29, 2015 at 8:37 AM, Pietro Gentile pietro.gentile89.develo...@gmail.com wrote: Hi all, What is the best way to remotely debug, with breakpoints, spark apps? Thanks in advance, Best regards! Pietro

[SparkR] Missing Spark APIs in R

2015-06-29 Thread Pradeep Bashyal
Hello, I noticed that some of the spark-core APIs are not available with version 1.4.0 release of SparkR. For example textFile(), flatMap() etc. The code seems to be there but is not exported in NAMESPACE. They were all available as part of the AmpLab Extras previously. I wasn't able to find any

Re: problem for submitting job

2015-06-29 Thread Akhil Das
Cool. On 29 Jun 2015 21:10, 郭谦 buptguoq...@gmail.com wrote: Akhil Das, You give me a new idea to solve the problem. Vova provides me a way to solve the problem just before Vova Shelgunovvvs...@gmail.com Sample code for submitting job from any other java app, e.g. servlet:

Re: Spark shell crumbles after memory is full

2015-06-29 Thread ayan guha
When you call collect, you are bringing whole dataset back to driver memory. On 30 Jun 2015 01:43, hbogert hansbog...@gmail.com wrote: I'm running a query from the BigDataBenchmark, query 1B to be precise. When running this with Spark (1.3.1)+ mesos(0.21) in coarse grained mode with 5 mesos

Load Multiple DB Table - Spark SQL

2015-06-29 Thread Ashish Soni
Hi All , What is the best possible way to load multiple data tables using spark sql MapString, String options = new HashMap(); options.put(driver, MYSQLDR); options.put(url, MYSQL_CN_URL); options.put(dbtable,(select * from courses); *can i add multiple tables to options map

Re: Spark shell crumbles after memory is full

2015-06-29 Thread Hans van den Bogert
Would there be a way to force the 'old' data out? Because at this point I'll have to restart the shell every couple of queries to get meaningful timings which are comparable to spark submit . On Jun 29, 2015 6:20 PM, Mark Hamstra m...@clearstorydata.com wrote: No. He is collecting the results

Need clarification on spark on cluster set up instruction

2015-06-29 Thread manish ranjan
Hi All here goes my first question : Here is my use case I have 1TB data I want to process on ec2 using spark I have uploaded the data on ebs volume The instruction on amazon ec2 set up explains *If your application needs to access large datasets, the fastest way to do that is to load them from

Re: kmeans broadcast

2015-06-29 Thread Himanshu Mehra
Hi Haviv, have you tried sc.broadcast(model), the broadcast method is a member of sparkContext class. Thanks Himanshu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/kmeans-broadcast-tp23511p23526.html Sent from the Apache Spark User List mailing list

Re: s3 bucket access/read file

2015-06-29 Thread spark user
Pls check your ACL properties. On Monday, June 29, 2015 11:29 AM, didi did...@gmail.com wrote: Hi *Cant read text file from s3 to create RDD * after setting the configuration val hadoopConf=sparkContext.hadoopConfiguration; hadoopConf.set(fs.s3.impl,

SparkSQL built in functions

2015-06-29 Thread Bob Corsaro
I'm having trouble using select pow(col) from table It seems the function is not registered for SparkSQL. Is this on purpose or an oversight? I'm using pyspark.

s3 bucket access/read file

2015-06-29 Thread didi
Hi *Cant read text file from s3 to create RDD * after setting the configuration val hadoopConf=sparkContext.hadoopConfiguration; hadoopConf.set(fs.s3.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey)

Re: Load Multiple DB Table - Spark SQL

2015-06-29 Thread Ted Yu
To my knowledge this is not supported. On Mon, Jun 29, 2015 at 10:47 AM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , What is the best possible way to load multiple data tables using spark sql MapString, String options = new HashMap(); options.put(driver, MYSQLDR); options.put(url,

Re: [SparkR] Missing Spark APIs in R

2015-06-29 Thread Shivaram Venkataraman
The RDD API is pretty complex and we are not yet sure we want to export all those methods in the SparkR API. We are working towards exposing a more limited API in upcoming versions. You can find some more details in the recent Spark Summit talk at

is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-06-29 Thread Axel Dahl
In pyspark, when I convert from rdds to dataframes it looks like the rdd is being materialized/collected/repartitioned before it's converted to a dataframe. Just wondering if there's any guidelines for doing this conversion and whether it's best to do it early to get the performance benefits of

Re: SparkSQL built in functions

2015-06-29 Thread Salih Oztop
Hi Bob,I tested your scenario with Spark 1.3 and I assumed you did not miss the second parameter of pow(x,y) from pyspark.sql import SQLContextsqlContext = SQLContext(sc) df = sqlContext.jsonFile(/vagrant/people.json)# Displays the content of the DataFrame to stdoutdf.show()#These are all

Job failed but there is no proper reason

2015-06-29 Thread ๏̯͡๏
I have a join + leftOuterJoin + reduceByKey. the join operation worked but the left outer join failed. There are million log lines and when i did this ~ dvasthimal$ cat ~/Desktop/errors | grep ERROR | grep -v client.TransportResponseHandler | grep -v shuffle.RetryingBlockFetcher | grep -v

Checkpoint FS failure or connectivity issue

2015-06-29 Thread Amit Assudani
Hi All, While using Checkpoints ( using HDFS ), if connectivity to hadoop cluster is lost for a while and gets restored in some time, what happens to the running streaming job. Is it always assumed that connection to checkpoint FS ( this case HDFS ) would ALWAYS be HA and would never fail for

Re: SparkSQL built in functions

2015-06-29 Thread Bob Corsaro
1.4 and I did set the second parameter. The DSL works fine but trying out with SQL doesn't. On Mon, Jun 29, 2015, 4:32 PM Salih Oztop soz...@yahoo.com wrote: Hi Bob, I tested your scenario with Spark 1.3 and I assumed you did not miss the second parameter of pow(x,y) from pyspark.sql import

Re: SparkSQL built in functions

2015-06-29 Thread Krishna Sankar
Interesting. Looking at the definitions, sql.functions.pow is defined only for (col,col). Just as an experiment, create a column with value 2 and see if that works. Cheers k/ On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro rcors...@gmail.com wrote: 1.4 and I did set the second parameter. The DSL

Re: How to recover in case user errors in streaming

2015-06-29 Thread Tathagata Das
I recommend writing using dstream.foreachRDD, and then rdd.saveAsNewAPIHadoopFile inside try catch. See the implementation of dstream.saveAsNewAPIHadoopFiles https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L716 On

Re: Applying functions over certain count of tuples .

2015-06-29 Thread Richard Marscher
Hi, not sure what the context is but I think you can do something similar with mapPartitions: rdd.mapPartitions { iterator = iterator.grouped(5).map { tupleGroup = emitOneRddForGroup(tupleGroup) } } The edge case is when the final grouping doesn't have exactly 5 items, if that matters. On

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-29 Thread Srikanth
My error was related to Scala version. Upon further reading, I realized that it takes some effort to get Spark working with Scala 2.11. I've reverted to using 2.10 and moved past that error. Now I hit the issue you mentioned. Waiting for 1.4.1. Srikanth On Fri, Jun 26, 2015 at 9:10 AM, Roberto

Re: SparkSQL built in functions

2015-06-29 Thread Salih Oztop
Hitested wih Spark 1.4 We need to import pow otherwise it uses python version of pow I guess. from pyspark.sql.functions import pow df.select(pow(df.age,df.age)).show() 15/06/29 22:36:05 INFO Ta++| POWER(age, age)|++| null||

Checkpoint support?

2015-06-29 Thread ๏̯͡๏
My job has multiple stages, each time a stage fails i have to restart the entire app. I understand Spark restarts failed tasks. However, Is there a way to restart a Spark app from failed stage ? -- Deepak

Applying functions over certain count of tuples .

2015-06-29 Thread anshu shukla
I want to apply some logic on the basis of a FIX count of number of tuples in each RDD . *suppose emit one rdd for every 5 tuple of previous RDD . * -- Thanks Regards, Anshu Shukla

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
HI Jey, Not much of luck. If I use the class com.databricks:spark-csv_2. 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found error. With com.databricks.spark.csv I don't get the class not found error but I still get the previous error even after using file:/// in the URI.

Re: breeze.linalg.DenseMatrix not found

2015-06-29 Thread AlexG
I get the same error even when I define covOperator not to use a matrix at all: def covOperator(v : BDV[Double]) :BDV[Double] = { v } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/breeze-linalg-DenseMatrix-not-found-tp23537p23538.html Sent from the

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Hi Sourav, The error seems to be caused by the fact that your URL starts with file:// instead of file:///. Also, I believe the current version of the package for Spark 1.4 with Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0. -Jey On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder

breeze.linalg.DenseMatrix not found

2015-06-29 Thread AlexG
I'm trying to compute the eigendecomposition of a matrix in a portion of my code, using mllib.linalg.EigenValueDecomposition (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala ) as follows: val tol = 1e-10 val maxIter

Re: spark streaming - checkpoint

2015-06-29 Thread ram kumar
on using yarn-cluster, it works good On Mon, Jun 29, 2015 at 12:07 PM, ram kumar ramkumarro...@gmail.com wrote: SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/* in spark-env.sh I think i am facing the same issue https://issues.apache.org/jira/browse/SPARK-6203 On Mon,

Re: spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-29 Thread Akhil Das
Here's a bunch of configuration for that https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior Thanks Best Regards On Fri, Jun 26, 2015 at 10:37 PM, igor.berman igor.ber...@gmail.com wrote: Hi, wanted to get some advice regarding tunning spark application I see for some of

Re: Master dies after program finishes normally

2015-06-29 Thread Akhil Das
Which version of spark are you using? You can try changing the heap size manually by *export _JAVA_OPTIONS=-Xmx5g * Thanks Best Regards On Fri, Jun 26, 2015 at 7:52 PM, Yifan LI iamyifa...@gmail.com wrote: Hi, I just encountered the same problem, when I run a PageRank program which has lots

Re: problem for submitting job

2015-06-29 Thread Akhil Das
You can create a SparkContext in your program and run it as a standalone application without using spark-submit. Here's something that will get you started: //Create SparkContext val sconf = new SparkConf() .setMaster(spark://spark-ak-master:7077) .setAppName(Test)

SparkContext and JavaSparkContext

2015-06-29 Thread Hao Ren
Hi, I am working on legacy project using spark java code. I have a function which takes sqlContext as an argument, however, I need a JavaSparkContext in that function. It seems that sqlContext.sparkContext() return a scala sparkContext. I did not find any API for casting a scala sparkContext

Spark SQL parallel query submission via single HiveContext

2015-06-29 Thread V Dineshkumar
Hi, As per my use case I need to submit multiple queries to Spark SQL in parallel but due to HiveContext being thread safe the jobs are getting submitted sequentially. I could see many threads are waiting for HiveContext. on-spray-can-akka.actor.default-dispatcher-26 - Thread t@149

Re: spark streaming - checkpoint

2015-06-29 Thread ram kumar
SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/* in spark-env.sh I think i am facing the same issue https://issues.apache.org/jira/browse/SPARK-6203 On Mon, Jun 29, 2015 at 11:38 AM, ram kumar ramkumarro...@gmail.com wrote: I am using Spark 1.2.0.2.2.0.0-82 (git revision

Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Hello, It is my understanding that shuffle are written on disk and that they act as checkpoints. I wonder if this is true only within a job, or across jobs. Please note that I use the words job and stage carefully here. 1. can a shuffle created during JobN be used to skip many stages from

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Ah, for #3, maybe this is what *rdd.checkpoint *does! https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD Thomas On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, It is my understanding that shuffle are written on disk and

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Thanks Silvio. On Mon, Jun 29, 2015 at 7:41 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Regarding 1 and 2, yes shuffle output is stored on the worker local disks and will be reused across jobs as long as they’re available. You can identify when they’re used by seeing skipped

Re: Checkpoint FS failure or connectivity issue

2015-06-29 Thread Tathagata Das
Yes, the observation is correct. That connectivity is assumed to be HA. On Mon, Jun 29, 2015 at 2:34 PM, Amit Assudani aassud...@impetus.com wrote: Hi All, While using Checkpoints ( using HDFS ), if connectivity to hadoop cluster is lost for a while and gets restored in some time, what

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi Jey, This solves the class not found problem. Thanks. But still the inputs format is not yet resolved. Looks like it is still trying to create a HadoopRDD I don't know why. The error message goes like - java.lang.RuntimeException: Error in configuring object at

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
The format is still com.databricks.spark.csv, but the parameter passed to spark-shell is --packages com.databricks:spark-csv_2.11:1.1.0. On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: HI Jey, Not much of luck. If I use the class

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-29 Thread Thomas Gerber
It seems the root cause of the delay was the sheer size of the DAG for those jobs, which are towards the end of a long series of jobs. To reduce it, you can probably try to checkpoint (rdd.checkpoint) some previous RDDs. That will: 1. save the RDD on disk 2. remove all references to the parents

Re: Shuffle files lifecycle

2015-06-29 Thread Silvio Fiorito
Regarding 1 and 2, yes shuffle output is stored on the worker local disks and will be reused across jobs as long as they’re available. You can identify when they’re used by seeing skipped stages in the job UI. They are periodically cleaned up based on available space of the configured

spark streaming HDFS file issue

2015-06-29 Thread ravi tella
I am running a spark streaming example from learning spark book with one change. The change I made was for streaming a file from HDFS. val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input) I ran the application number of times and every time dropped a new file in the input

Re: Job failed but there is no proper reason

2015-06-29 Thread Matthew Jones
Order the tasks by status and see if there are any with status failed. On Mon, 29 Jun 2015 at 2:26 pm ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Attached pic shows the error that is displayed in Spark UI. On Mon, Jun 29, 2015 at 2:22 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a join

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
All InputFormats will use HadoopRDD or NewHadoopRDD. Do you use file:/// instead of file://? On Mon, Jun 29, 2015 at 8:40 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi Jey, This solves the class not found problem. Thanks. But still the inputs format is not yet resolved. Looks

Spark 1.4.0: read.df() causes excessive IO

2015-06-29 Thread Exie
Hi Folks, I just stepped up from 1.3.1 to 1.4.0, the most notable difference for me so far is the data frame reader/writer. Previously: val myData = hiveContext.load(s3n://someBucket/somePath/,parquet) Now: val myData = hiveContext.read.parquet(s3n://someBucket/somePath) Using the original

Re: Job failed but there is no proper reason

2015-06-29 Thread ๏̯͡๏
There are many with failed. Attached pic shows exception traces. On Mon, Jun 29, 2015 at 9:07 PM, Matthew Jones mle...@gmail.com wrote: Order the tasks by status and see if there are any with status failed. On Mon, 29 Jun 2015 at 2:26 pm ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Attached pic

Re: Job failed but there is no proper reason

2015-06-29 Thread Matthew Jones
I can't see any failed tasks in the attached pic. On Mon, Jun 29, 2015 at 9:31 PM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: There are many with failed. Attached pic shows exception traces. On Mon, Jun 29, 2015 at 9:07 PM, Matthew Jones mle...@gmail.com wrote: Order the tasks by status and

Error while installing spark

2015-06-29 Thread Chintan Bhatt
Facing following error message while performing sbt/sbt assembly Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. pls give me solution for it --

Can Dependencies Be Resolved on Spark Cluster?

2015-06-29 Thread SLiZn Liu
Hey Spark Users, I'm writing a demo with Spark and HBase. What I've done is packaging a **fat jar**: place dependencies in `build.sbt`, and use `sbt assembly` to package **all dependencies** into one big jar. The rest work is copy the fat jar to Spark master node and then launch by

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi Jey, Thanks for your inputs. Probably I'm getting error as I'm trying to read a csv file from local file using com.databricks.spark.csv package. Probably this package has hard coded dependency on Hadoop as it is trying to read input format from HadoopRDD. Can you please confirm ? Here is