elp provide some clue.
>
> On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin <ilgan...@gmail.com> wrote:
>
>> Hello - I'm trying to deploy the Spark TimeSeries library in a new
>> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
>> with Java 8
Hello - I'm trying to deploy the Spark TimeSeries library in a new
environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
with Java 8 installed on all nodes but I'm getting the NoClassDef at
runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is
part of Java 8
It gets serialized once per physical container, Instead of being serialized
once per task of every stage that uses it.
On Sat, Feb 20, 2016 at 8:15 AM jeff saremi wrote:
> Is the broadcasted variable distributed to every executor or every worker?
> Now i'm more confused
>
The solution I normally use is to zipWithIndex() and then use the filter
operation. Filter is an O(m) operation where m is the size of your
partition, not an O(N) operation.
-Ilya Ganelin
On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <npa...@xactlycorp.com> wrote:
> Problem is I
r.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m
-XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+AggressiveOpts -XX:+UseCompressedOops" --master yarn-client
-Ilya Ganelin
Turning off replication sacrifices durability of your data, so if a node
goes down the data is lost - in case that's not obvious.
On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens wrote:
> Thanks, the issue was indeed the dfs replication factor. To fix it without
> entirely
Your Kerberos cert is likely expiring. Check your expiration settings.
-Ilya Ganelin
On Mon, Nov 16, 2015 at 9:20 PM, Vipul Rai <vipulrai8...@gmail.com> wrote:
> Hi Nikhil,
> It seems you have Kerberos enabled cluster and it is unable to
> authenticate using the ticket.
()
model = ALS.train(ratings, rank, numIterations)
On Jun 28, 2015, at 8:24 AM, Ilya Ganelin ilgan...@gmail.com wrote:
You can also select pieces of your RDD by first doing a zipWithIndex and
then doing a filter operation on the second element of the RDD.
For example to select the first 100
Oops - code should be :
Val a = rdd.zipWithIndex().filter(s = 1 s._2 100)
On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin ilgan...@gmail.com wrote:
You can also select pieces of your RDD by first doing a zipWithIndex and
then doing a filter operation on the second element of the RDD
results you may want to look into locality
sensitive hashing to limit your search space and definitely look into
spinning up multiple threads to process your product features in parallel
to increase resource utilization on the cluster.
Thank you,
Ilya Ganelin
-Original Message
All - this issue showed up when I was tearing down a spark context and
creating a new one. Often, I was unable to then write to HDFS due to this
error. I subsequently switched to a different implementation where instead
of tearing down and re initializing the spark context I'd instead submit a
The broadcast variable is like a pointer. If the underlying data changes
then the changes will be visible throughout the cluster.
On Fri, May 15, 2015 at 5:18 PM NB nb.nos...@gmail.com wrote:
Hello,
Once a broadcast variable is created using sparkContext.broadcast(), can it
ever be updated
, Ilya Ganelin ilgan...@gmail.com wrote:
The broadcast variable is like a pointer. If the underlying data changes
then the changes will be visible throughout the cluster.
On Fri, May 15, 2015 at 5:18 PM NB nb.nos...@gmail.com wrote:
Hello,
Once a broadcast variable is created using
I believe the typical answer is that Spark is actually a bit slower.
On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com bit1...@163.com wrote:
Hi,
I am frequently asked why spark is also much faster than Hadoop MapReduce
on disk (without the use of memory cache). I have no convencing answer for
of the input RDD was low as well so the chunks were really too
big. Increased parallelism and repartitioning seems to be helping...
Thanks!
Antony.
On Thursday, 19 February 2015, 16:45, Ilya Ganelin ilgan...@gmail.com
wrote:
Hi Anthony - you are seeing a problem that I ran
Yep. the matrix model had two RDD vectors representing the decomposed
matrix. You can save these to disk and re use them.
On Thu, Feb 19, 2015 at 2:19 AM Antony Mayi antonym...@yahoo.com.invalid
wrote:
Hi,
when getting the model out of ALS.train it would be beneficial to store it
(to disk) so
Hi Anthony - you are seeing a problem that I ran into. The underlying issue
is your default parallelism setting. What's happening is that within ALS
certain RDD operations end up changing the number of partitions you have of
your data. For example if you start with an RDD of 300 partitions, unless
Hi Shahab - if your data structures are small enough a broadcasted Map is
going to provide faster lookup. Lookup within an RDD is an O(m) operation
where m is the size of the partition. For RDDs with multiple partitions,
executors can operate on it in parallel so you get some improvement for
By default you will have (fileSize in Mb / 64) partitions. You can also set
the number of partitions when you read in a file with sc.textFile as an
optional second parameter.
On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli lu...@di.unipi.it wrote:
Hi All,
Could you please help me
The stupid question is whether you're deleting the file from hdfs on the
right node?
On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov pavel.velik...@gmail.com
wrote:
Yeah, I do manually delete the files, but it still fails with this error.
On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya
Welcome to Spark. What's more fun is that setting controls memory on the
executors but if you want to set memory limit on the driver you need to
configure it as a parameter of the spark-submit script. You also set
num-executors and executor-cores on the spark submit call.
See both the Spark
Hi Patrick - is that cleanup present in 1.1?
The overhead I am talking about is with regards to what I believe is
shuffle related metadata. If I watch the execution log I see small
broadcast variables created for every stage of execution, a few KB at a
time, and a certain number of MB remaining
Are you trying to do this in the shell? Shell is instantiated with a spark
context named sc.
-Ilya Ganelin
On Sat, Dec 27, 2014 at 5:24 PM, tfrisk tfris...@gmail.com wrote:
Hi,
Doing:
val ssc = new StreamingContext(conf, Seconds(1))
and getting:
Only one SparkContext may
Hello all - can anyone please offer any advice on this issue?
-Ilya Ganelin
On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Hi all, I have a long running job iterating over a huge dataset. Parts of
this operation are cached. Since the job runs for so long
Hi all. I've been working on a similar problem. One solution that is
straightforward (if suboptimal) is to do the following.
A.zipWithIndex().filter(_._2 =range_start _._2 range_end). Lastly
just put that in a for loop. I've found that this approach scales very
well.
As Matei said another
The split is something like 30 million into 2 milion partitions. The reason
that it becomes tractable is that after I perform the Cartesian on the
split data and operate on it I don't keep the full results - I actually
only keep a tiny fraction of that generated dataset - making the overall
Hello Steve .
1) When you call new SparkConf you should get an object with the default
config values. You can reference the spark configuration and tuning pages
for details on what those are.
2) Yes. Properties set in this configuration will be pushed down to worker
nodes actually executing the
%40mail.gmail.com%253Eei=97FPVIfyCsbgsASL94CoDQusg=AFQjCNEQ6gUlwpr6KzlcZVd0sQeCSdjQgQsig2=Ne7pL_Z94wN4g9BwSutsXQ
-Ilya Ganelin
On Mon, Oct 27, 2014 at 6:12 PM, Xiangrui Meng men...@gmail.com wrote:
Could you save the data before ALS and try to reproduce the problem?
You might try reducing
In Spark, certain functions have an optional parameter to determine the
number of partitions (distinct, textFile, etc..). You can also use the
coalesce () or repartiton() functions to change the number of partitions
for your RDD. Thanks.
On Oct 28, 2014 9:58 AM, shahab shahab.mok...@gmail.com
isn't obvious yet.
-Ilya Ganelin
- Original Message -
From: Ilya Ganelin ilgan...@gmail.com
To: user user@spark.apache.org
Sent: Monday, October 27, 2014 11:36:46 AM
Subject: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0
Hello all - I am attempting to run MLLib's ALS algorithm on a substantial
Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can
no longer set the number of executors or executor-cores. No matter what
values I pass on the command line to spark they are overwritten by the
defaults. Does anyone have any idea what could have happened here? Running
on
Hey Steve - the way to do this is to use the coalesce() function to
coalesce your RDD into a single partition. Then you can do a saveAsTextFile
and you'll wind up with outpuDir/part-0 containing all the data.
-Ilya Ganelin
On Mon, Oct 20, 2014 at 11:01 PM, jay vyas jayunit100.apa
Check for any variables you've declared in your class. Even if you're not
calling them from the function they are passed to the worker nodes as part
of the context. Consequently, if you have something without a default
serializer (like an imported class) it will also get passed.
To fix this you
Also - if you're doing a text file read you can pass the number of
resulting partitions as the second argument.
On Oct 17, 2014 9:05 PM, Larry Liu larryli...@gmail.com wrote:
Thanks, Andrew. What about reading out of local?
On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash and...@andrewash.com
Hello all . Does anyone else have any suggestions? Even understanding what
this error is from would help a lot.
On Oct 11, 2014 12:56 AM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi Akhil - I tried your suggestions and tried varying my partition sizes.
Reducing the number of partitions led
to each reducer was a problem, so Reynold has a patch that avoids that if
the number of tasks is large.
Matei
On Oct 10, 2014, at 10:09 PM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi Matei - I read your post with great interest. Could you possibly
comment in more depth on some
Because of how closures work in Scala, there is no support for nested
map/rdd-based operations. Specifically, if you have
Context a {
Context b {
}
}
Operations within context b, when distributed across nodes, will no longer
have visibility of variables specific to context a because
to 200.
Thanks
Best Regards
On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi all – I could use some help figuring out a couple of exceptions I’ve
been getting regularly.
I have been running on a fairly large dataset (150 gigs). With smaller
datasets I don't have
at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi all – I could use some help figuring out a couple of exceptions I’ve
been getting regularly.
I have been running on a fairly large dataset (150 gigs). With smaller
datasets I don't have any issues.
My sequence of operations is as follows
Hi Matei - I read your post with great interest. Could you possibly comment
in more depth on some of the issues you guys saw when scaling up spark and
how you resolved them? I am interested specifically in spark-related
problems. I'm working on scaling up spark to very large datasets and have
been
I would also be interested in knowing more about this. I have used the
cloudera manager and the spark resource interface (clientnode:4040) but
would love to know if there are other tools out there - either for post
processing or better observation during execution.
On Oct 9, 2014 4:50 PM, Rohit
On Oct 9, 2014 10:18 AM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi all – I could use some help figuring out a couple of exceptions I’ve
been getting regularly.
I have been running on a fairly large dataset (150 gigs). With smaller
datasets I don't have any issues.
My sequence of operations
Hi all – I could use some help figuring out a couple of exceptions I’ve
been getting regularly.
I have been running on a fairly large dataset (150 gigs). With smaller
datasets I don't have any issues.
My sequence of operations is as follows – unless otherwise specified, I am
not caching:
Map a
44 matches
Mail list logo