Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ilya Ganelin
elp provide some clue. > > On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin <ilgan...@gmail.com> wrote: > >> Hello - I'm trying to deploy the Spark TimeSeries library in a new >> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster >> with Java 8

NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ilya Ganelin
Hello - I'm trying to deploy the Spark TimeSeries library in a new environment. I'm running Spark 1.6.1 submitted through YARN in a cluster with Java 8 installed on all nodes but I'm getting the NoClassDef at runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is part of Java 8

Re: Access to broadcasted variable

2016-02-20 Thread Ilya Ganelin
It gets serialized once per physical container, Instead of being serialized once per task of every stage that uses it. On Sat, Feb 20, 2016 at 8:15 AM jeff saremi wrote: > Is the broadcasted variable distributed to every executor or every worker? > Now i'm more confused >

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-24 Thread Ilya Ganelin
The solution I normally use is to zipWithIndex() and then use the filter operation. Filter is an O(m) operation where m is the size of your partition, not an O(N) operation. -Ilya Ganelin On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > Problem is I

Spark LDA

2016-01-22 Thread Ilya Ganelin
r.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -XX:+UseCompressedOops" --master yarn-client -Ilya Ganelin

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Ilya Ganelin
Turning off replication sacrifices durability of your data, so if a node goes down the data is lost - in case that's not obvious. On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens wrote: > Thanks, the issue was indeed the dfs replication factor. To fix it without > entirely

Re: Spark Job is getting killed after certain hours

2015-11-16 Thread Ilya Ganelin
Your Kerberos cert is likely expiring. Check your expiration settings. -Ilya Ganelin On Mon, Nov 16, 2015 at 9:20 PM, Vipul Rai <vipulrai8...@gmail.com> wrote: > Hi Nikhil, > It seems you have Kerberos enabled cluster and it is unable to > authenticate using the ticket.

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
() model = ALS.train(ratings, rank, numIterations) On Jun 28, 2015, at 8:24 AM, Ilya Ganelin ilgan...@gmail.com wrote: You can also select pieces of your RDD by first doing a zipWithIndex and then doing a filter operation on the second element of the RDD. For example to select the first 100

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
Oops - code should be : Val a = rdd.zipWithIndex().filter(s = 1 s._2 100) On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin ilgan...@gmail.com wrote: You can also select pieces of your RDD by first doing a zipWithIndex and then doing a filter operation on the second element of the RDD

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
results you may want to look into locality sensitive hashing to limit your search space and definitely look into spinning up multiple threads to process your product features in parallel to increase resource utilization on the cluster. Thank you, Ilya Ganelin -Original Message

Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-16 Thread Ilya Ganelin
All - this issue showed up when I was tearing down a spark context and creating a new one. Often, I was unable to then write to HDFS due to this error. I subsequently switched to a different implementation where instead of tearing down and re initializing the spark context I'd instead submit a

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread Ilya Ganelin
The broadcast variable is like a pointer. If the underlying data changes then the changes will be visible throughout the cluster. On Fri, May 15, 2015 at 5:18 PM NB nb.nos...@gmail.com wrote: Hello, Once a broadcast variable is created using sparkContext.broadcast(), can it ever be updated

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread Ilya Ganelin
, Ilya Ganelin ilgan...@gmail.com wrote: The broadcast variable is like a pointer. If the underlying data changes then the changes will be visible throughout the cluster. On Fri, May 15, 2015 at 5:18 PM NB nb.nos...@gmail.com wrote: Hello, Once a broadcast variable is created using

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Ilya Ganelin
I believe the typical answer is that Spark is actually a bit slower. On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com bit1...@163.com wrote: Hi, I am frequently asked why spark is also much faster than Hadoop MapReduce on disk (without the use of memory cache). I have no convencing answer for

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Ilya Ganelin
of the input RDD was low as well so the chunks were really too big. Increased parallelism and repartitioning seems to be helping... Thanks! Antony. On Thursday, 19 February 2015, 16:45, Ilya Ganelin ilgan...@gmail.com wrote: Hi Anthony - you are seeing a problem that I ran

Re: storing MatrixFactorizationModel (pyspark)

2015-02-19 Thread Ilya Ganelin
Yep. the matrix model had two RDD vectors representing the decomposed matrix. You can save these to disk and re use them. On Thu, Feb 19, 2015 at 2:19 AM Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, when getting the model out of ALS.train it would be beneficial to store it (to disk) so

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Ilya Ganelin
Hi Anthony - you are seeing a problem that I ran into. The underlying issue is your default parallelism setting. What's happening is that within ALS certain RDD operations end up changing the number of partitions you have of your data. For example if you start with an RDD of 300 partitions, unless

Re: Why is RDD lookup slow?

2015-02-19 Thread Ilya Ganelin
Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster lookup. Lookup within an RDD is an O(m) operation where m is the size of the partition. For RDDs with multiple partitions, executors can operate on it in parallel so you get some improvement for

Re: RDD Partition number

2015-02-19 Thread Ilya Ganelin
By default you will have (fileSize in Mb / 64) partitions. You can also set the number of partitions when you read in a file with sc.textFile as an optional second parameter. On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli lu...@di.unipi.it wrote: Hi All, Could you please help me

Re: Spark job fails on cluster but works fine on a single machine

2015-02-19 Thread Ilya Ganelin
The stupid question is whether you're deleting the file from hdfs on the right node? On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov pavel.velik...@gmail.com wrote: Yeah, I do manually delete the files, but it still fails with this error. On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya

Re: spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Ilya Ganelin
Welcome to Spark. What's more fun is that setting controls memory on the executors but if you want to set memory limit on the driver you need to configure it as a parameter of the spark-submit script. You also set num-executors and executor-cores on the spark submit call. See both the Spark

Re: Long-running job cleanup

2014-12-28 Thread Ilya Ganelin
Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see small broadcast variables created for every stage of execution, a few KB at a time, and a certain number of MB remaining

Re: Problem with StreamingContext - getting SPARK-2243

2014-12-27 Thread Ilya Ganelin
Are you trying to do this in the shell? Shell is instantiated with a spark context named sc. -Ilya Ganelin On Sat, Dec 27, 2014 at 5:24 PM, tfrisk tfris...@gmail.com wrote: Hi, Doing: val ssc = new StreamingContext(conf, Seconds(1)) and getting: Only one SparkContext may

Re: Long-running job cleanup

2014-12-25 Thread Ilya Ganelin
Hello all - can anyone please offer any advice on this issue? -Ilya Ganelin On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long

Re: what is the best way to implement mini batches?

2014-12-11 Thread Ilya Ganelin
Hi all. I've been working on a similar problem. One solution that is straightforward (if suboptimal) is to do the following. A.zipWithIndex().filter(_._2 =range_start _._2 range_end). Lastly just put that in a for loop. I've found that this approach scales very well. As Matei said another

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Ilya Ganelin
The split is something like 30 million into 2 milion partitions. The reason that it becomes tractable is that after I perform the Cartesian on the split data and operate on it I don't keep the full results - I actually only keep a tiny fraction of that generated dataset - making the overall

Re: Questions about serialization and SparkConf

2014-10-29 Thread Ilya Ganelin
Hello Steve . 1) When you call new SparkConf you should get an object with the default config values. You can reference the spark configuration and tuning pages for details on what those are. 2) Yes. Properties set in this configuration will be pushed down to worker nodes actually executing the

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Ilya Ganelin
%40mail.gmail.com%253Eei=97FPVIfyCsbgsASL94CoDQusg=AFQjCNEQ6gUlwpr6KzlcZVd0sQeCSdjQgQsig2=Ne7pL_Z94wN4g9BwSutsXQ -Ilya Ganelin On Mon, Oct 27, 2014 at 6:12 PM, Xiangrui Meng men...@gmail.com wrote: Could you save the data before ALS and try to reproduce the problem? You might try reducing

Re: How can number of partitions be set in spark-env.sh?

2014-10-28 Thread Ilya Ganelin
In Spark, certain functions have an optional parameter to determine the number of partitions (distinct, textFile, etc..). You can also use the coalesce () or repartiton() functions to change the number of partitions for your RDD. Thanks. On Oct 28, 2014 9:58 AM, shahab shahab.mok...@gmail.com

MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Ilya Ganelin
isn't obvious yet. -Ilya Ganelin

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Ilya Ganelin
- Original Message - From: Ilya Ganelin ilgan...@gmail.com To: user user@spark.apache.org Sent: Monday, October 27, 2014 11:36:46 AM Subject: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0 Hello all - I am attempting to run MLLib's ALS algorithm on a substantial

Num-executors and executor-cores overwritten by defaults

2014-10-21 Thread Ilya Ganelin
Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can no longer set the number of executors or executor-cores. No matter what values I pass on the command line to spark they are overwritten by the defaults. Does anyone have any idea what could have happened here? Running on

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Ilya Ganelin
Hey Steve - the way to do this is to use the coalesce() function to coalesce your RDD into a single partition. Then you can do a saveAsTextFile and you'll wind up with outpuDir/part-0 containing all the data. -Ilya Ganelin On Mon, Oct 20, 2014 at 11:01 PM, jay vyas jayunit100.apa

Re: What's wrong with my spark filter? I get org.apache.spark.SparkException: Task not serializable

2014-10-19 Thread Ilya Ganelin
Check for any variables you've declared in your class. Even if you're not calling them from the function they are passed to the worker nodes as part of the context. Consequently, if you have something without a default serializer (like an imported class) it will also get passed. To fix this you

Re: input split size

2014-10-18 Thread Ilya Ganelin
Also - if you're doing a text file read you can pass the number of resulting partitions as the second argument. On Oct 17, 2014 9:05 PM, Larry Liu larryli...@gmail.com wrote: Thanks, Andrew. What about reading out of local? On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash and...@andrewash.com

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-14 Thread Ilya Ganelin
Hello all . Does anyone else have any suggestions? Even understanding what this error is from would help a lot. On Oct 11, 2014 12:56 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Akhil - I tried your suggestions and tried varying my partition sizes. Reducing the number of partitions led

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Ilya Ganelin
to each reducer was a problem, so Reynold has a patch that avoids that if the number of tasks is large. Matei On Oct 10, 2014, at 10:09 PM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Matei - I read your post with great interest. Could you possibly comment in more depth on some

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread Ilya Ganelin
Because of how closures work in Scala, there is no support for nested map/rdd-based operations. Specifically, if you have Context a { Context b { } } Operations within context b, when distributed across nodes, will no longer have visibility of variables specific to context a because

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Ilya Ganelin
to 200. Thanks Best Regards On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Ilya Ganelin
at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Ilya Ganelin
Hi Matei - I read your post with great interest. Could you possibly comment in more depth on some of the issues you guys saw when scaling up spark and how you resolved them? I am interested specifically in spark-related problems. I'm working on scaling up spark to very large datasets and have been

Re: Debug Spark in Cluster Mode

2014-10-10 Thread Ilya Ganelin
I would also be interested in knowing more about this. I have used the cloudera manager and the spark resource interface (clientnode:4040) but would love to know if there are other tools out there - either for post processing or better observation during execution. On Oct 9, 2014 4:50 PM, Rohit

IOException and appcache FileNotFoundException in Spark 1.02

2014-10-09 Thread Ilya Ganelin
On Oct 9, 2014 10:18 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations

IOException and appcache FileNotFoundException in Spark 1.02

2014-10-09 Thread Ilya Ganelin
Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a