Re: secondary sort

2014-09-22 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-3655 On Mon, Sep 22, 2014 at 3:11 PM, Daniil Osipov wrote: > Adding an issue in JIRA would help keep track of the feature request: > > https://issues.apache.org/jira/browse/SPARK > > On Sat, Sep 20, 2014 at 7:39 AM, Koert Kuipers wrote

Re: Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-21 Thread Koert Kuipers
. On Mon, Sep 15, 2014 at 11:16 AM, Koert Kuipers wrote: > in spark 1.1.0 i get this error: > > 2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both > spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. > > i checked my applica

secondary sort

2014-09-20 Thread Koert Kuipers
now that spark has a sort based shuffle, can we expect a secondary sort soon? there are some use cases where getting a sorted iterator of values per key is helpful.

Re: Adjacency List representation in Spark

2014-09-18 Thread Koert Kuipers
we build our own adjacency lists as well. the main motivation for us was that graphx has some assumptions about everything fitting in memory (it has .cache statements all over place). however if my understanding is wrong and graphx can handle graphs that do not fit in memory i would be interested t

Re: SPARK_MASTER_IP

2014-09-15 Thread Koert Kuipers
hey mark, you think that this is on purpose, or is it an omission? thanks, koert On Mon, Sep 15, 2014 at 8:32 PM, Mark Grover wrote: > Hi Koert, > I work on Bigtop and CDH packaging and you are right, based on my quick > glance, it doesn't seem to be used. > > Mark >

Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-15 Thread Koert Kuipers
in spark 1.1.0 i get this error: 2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. i checked my application. i do not set spark.driver.extraClassPath or SPARK_CLASSPATH. SPARK_CLASSPATH is set in spark-env.sh since

Re: spark 1.1.0 unit tests fail

2014-09-14 Thread Koert Kuipers
s > the root cause is port collisions from the many SparkContexts we create > over the course of the entire test. There is a patch that fixes this but > not back ported into branch-1.1 yet. I will do that shortly. > > -Andrew > > 2014-09-13 17:27 GMT-07:00 Koert Kuipers : > >

spark 1.1.0 unit tests fail

2014-09-13 Thread Koert Kuipers
on ubuntu 12.04 with 2 cores and 8G of RAM i see errors when i run the tests for spark 1.1.0. not sure how significant this is, since i used to see errors for spark 1.0.0 too $ java -version java version "1.6.0_43" Java(TM) SE Runtime Environment (build 1.6.0_43-b01) Java HotSpot(TM) 64-Bit Se

SPARK_MASTER_IP

2014-09-12 Thread Koert Kuipers
a grep for SPARK_MASTER_IP shows that sbin/start-master.sh and sbin/start-slaves.sh are the only ones that use it. yet for example in CDH5 the spark-master is started from /etc/init.d/spark-master by running bin/spark-class. does that means SPARK_MASTER_IP is simply ignored? it looks like that to

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Koert Kuipers
matei, it is good to hear that the restriction that keys need to fit in memory no longer applies to combineByKey. however join requiring keys to fit in memory is still a big deal to me. does it apply to both sides of the join, or only one (while othe other side is streaming)? On Sat, Aug 30, 201

SchemaRDD

2014-08-27 Thread Koert Kuipers
i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?

mllib style

2014-08-11 Thread Koert Kuipers
i was just looking at ALS (mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala) any need all the variables need to be vars and to have all these setters around? it just leads to so much clutter if you really want them to the vars it is safe in scala to make them public (scala

spark-submit symlink

2014-08-05 Thread Koert Kuipers
spark-submit doesnt handle being symlinks currently: $ spark-submit /usr/local/bin/spark-submit: line 44: /usr/local/bin/spark-class: No such file or directory /usr/local/bin/spark-submit: line 44: exec: /usr/local/bin/spark-class: cannot execute: No such file or directory to fix i changed the lin

Re: how to publish spark inhouse?

2014-08-01 Thread Koert Kuipers
netty makes it way into the dependencies which causes the fun exception when you run a spark job: *org*.*jboss*.*netty*.*channel*.*ChannelException*: *Failed to bind to ...* On Tue, Jul 29, 2014 at 4:17 PM, Koert Kuipers wrote: > SBT does actually have a runtime scrope, but things might indeed

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
takes some work, but eventually you > save more time and errors using them than doing it manually. > > distributionManagement is inherited from the standard Apache parent. > > On Tue, Jul 29, 2014 at 8:58 PM, Koert Kuipers wrote: > > all i want to do is 1) change the version number

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
t; > distributionManagement is inherited from the standard Apache parent. > > On Tue, Jul 29, 2014 at 8:58 PM, Koert Kuipers wrote: > > all i want to do is 1) change the version number and 2) publish spark to > our > > internal maven repo. > > > > i just

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
eat i can try adding that myself. but how come it isnt there by default? i mean, spark does end up on mavem somehow? On Mon, Jul 28, 2014 at 3:39 PM, Koert Kuipers wrote: > ah ok thanks. guess i am gonna read up about maven-release-plugin then! > > > On Mon, Jul 28, 2014 at 3:37 PM, Se

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
/apache/spark/blob/master/dev/create-release/create-release.sh#L65 > > On Mon, Jul 28, 2014 at 12:39 PM, Koert Kuipers wrote: > > ah ok thanks. guess i am gonna read up about maven-release-plugin then! > > > > > > On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen wrote: > &

Re: how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
or you by this plugin. > > Maven requires artifacts to set a version and it can't inherit one. I > feel like I understood the reason this is necessary at one point. > > On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers wrote: > > and if i want to change the version, it seems

Re: how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
and if i want to change the version, it seems i have to change it in all 23 pom files? mhhh. is it mandatory for these sub-project pom files to repeat that version info? useful? spark$ grep 1.1.0-SNAPSHOT * -r | wc -l 23 On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers wrote: > hey we used

how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
hey we used to publish spark inhouse by simply overriding the publishTo setting. but now that we are integrated in SBT with maven i cannot find it anymore. i tried looking into the pom file, but after reading 1144 lines of xml i 1) havent found anything that looks like publishing 2) i feel somewha

Re: Spark as a application library vs infra

2014-07-27 Thread Koert Kuipers
i used to do 1) but couldnt get it to work on yarn and the trend seemed towards 2) using spark-submit so i gave in the main promise of 2) is tha you can provide an application that can run on multiple hadoop and spark versions. however for that to become true spark needs to address the issue of us

Re: graphx cached partitions wont go away

2014-07-26 Thread Koert Kuipers
never mind I think its just the GC taking its time while I got many gigabytes of unused cached rdds that I cannot get rid of easily On Jul 26, 2014 4:44 PM, "Koert Kuipers" wrote: > i have graphx queries running inside a service where i collect the results > to the driver and

graphx cached partitions wont go away

2014-07-26 Thread Koert Kuipers
i have graphx queries running inside a service where i collect the results to the driver and do not hold any references to the rdds involved in the queries. my assumption was that with the references gone spark would go and remove the cached rdds from memory (note, i did not cache them, graphx did)

using shapeless in spark to optimize data layout in memory

2014-07-23 Thread Koert Kuipers
hello all, in case anyone is interested, i just wrote a short blog about using shapeless in spark to optimize data layout in memory. blog is here: http://tresata.com/tresata-open-sources-spark-columnar code is here: https://github.com/tresata/spark-columnar

Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Koert Kuipers
but be aware that spark-defaults.conf is only used if you use spark-submit On Jul 17, 2014 4:29 PM, "Zongheng Yang" wrote: > One way is to set this in your conf/spark-defaults.conf: > > spark.executor.extraLibraryPath /path/to/native/lib > > The key is documented here: > http://spark.apache.org/d

Re: spark ui on yarn

2014-07-13 Thread Koert Kuipers
in both, but one possibility is that your > executors were given less memory on YARN. Can you check that? Or otherwise, > how do you know that some RDDs were cached? > > Matei > > On Jul 12, 2014, at 4:12 PM, Koert Kuipers wrote: > > hey shuo, > so far all stage links wo

Re: spark ui on yarn

2014-07-12 Thread Koert Kuipers
of > executors. > > Best, > > > > On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers wrote: > >> I just tested a long lived application (that we normally run in >> standalone mode) on yarn in client mode. >> >> it looks to me like cached rdds are missing in th

spark ui on yarn

2014-07-11 Thread Koert Kuipers
I just tested a long lived application (that we normally run in standalone mode) on yarn in client mode. it looks to me like cached rdds are missing in the storage tap of the ui. accessing the rdd storage information via the spark context shows rdds as fully cached but they are missing on storage

sparkStaging

2014-07-10 Thread Koert Kuipers
in spark 1.0.0 using yarn-client mode i am seeing that the sparkStaging directories do not get cleaned up. for example i run: $ spark-submit --class org.apache.spark.examples.SparkPi spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 10 after which i have this directory left behind with one file in it

Re: Purpose of spark-submit?

2014-07-10 Thread Koert Kuipers
on having parity within >>> SparkConf/SparkContext >>> > where possible. In my use case, we launch our jobs programmatically. In >>> > theory, we could shell out to spark-submit but it's not the best >>> option for >>> > us. >>>

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
f/SparkContext if you are willing to set >>>> your >>>> > own config? >>>> > >>>> > If there are any gaps, +1 on having parity within >>>> SparkConf/SparkContext >>>> > where possible. In my use case, we launch

Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
we simply hold on to the reference to the rdd after it has been cached. so we have a single Map[String, RDD[X]] for cached RDDs for the application On Wed, Jul 9, 2014 at 11:00 AM, premdass wrote: > Hi, > > Yes . I am caching the RDD's by calling cache method.. > > > May i ask, how you are sha

Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
did you explicitly cache the rdd? we cache rdds and share them between jobs just fine within one context in spark 1.0.x. but we do not use the ooyala job server... On Wed, Jul 9, 2014 at 10:03 AM, premdass wrote: > Hi, > > I using spark 1.0.0 , using Ooyala Job Server, for a low latency query

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to "hadoop classpath" that shows what needs to be added should be sufficient. on spark standalone I can launc

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Koert Kuipers
do you control your cluster and spark deployment? if so, you can try to rebuild with jetty 9.x On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter < martingammelsae...@gmail.com> wrote: > Digging a bit more I see that there is yet another jetty instance that > is causing the problem, namely the B

acl for spark ui

2014-07-07 Thread Koert Kuipers
i was testing using the acl for spark ui in secure mode on yarn in client mode. it works great. my spark 1.0.0 configuration has: spark.authenticate = true spark.ui.acls.enable = true spark.ui.view.acls = koert spark.ui.filters = org.apache.hadoop.security.authentication.server.AuthenticationFilte

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Koert Kuipers
spark has a setting to put user jars in front of classpath, which should do the trick. however i had no luck with this. see here: https://issues.apache.org/jira/browse/SPARK-1863 On Mon, Jul 7, 2014 at 1:31 PM, Robert James wrote: > spark-submit includes a spark-assembly uber jar, which has o

tiers of caching

2014-07-07 Thread Koert Kuipers
i noticed that some algorithms such as graphx liberally cache RDDs for efficiency, which makes sense. however it can also leave a long trail of unused yet cached RDDs, that might push other RDDs out of memory. in a long-lived spark context i would like to decide which RDDs stick around. would it m

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-07 Thread Koert Kuipers
you could only do the deep check if the hashcodes are the same and design hashcodes that do not take all elements into account. the alternative seems to be putting cache statements all over graphx, as is currently the case, which is trouble for any long lived application where caching is carefully

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Koert Kuipers
probably a dumb question, but why is reference equality used for the indexes? On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave wrote: > When joining two VertexRDDs with identical indexes, GraphX can use a fast > code path (a zip join without any hash lookups). However, the check for > identical inde

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Koert Kuipers
thanks for replying. why is joining two vertexrdds without caching slow? what is recomputed unnecessarily? i am not sure what is different here from joining 2 regular RDDs (where nobody seems to recommend to cache before joining i think...) On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave wrote: > O

Re: taking top k values of rdd

2014-07-05 Thread Koert Kuipers
that the p priority queues do not get send to the driver (which is not on cluster) On Sat, Jul 5, 2014 at 1:20 PM, Koert Kuipers wrote: > i guess i could create a single priorityque per partition, then shuffle to > a new rdd with 1 partition, and then reduce? > > > On Sat, Jul 5,

Re: taking top k values of rdd

2014-07-05 Thread Koert Kuipers
ust top k the combined top k from each partition > (assuming you have (object, count) for each top k list). > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers wrote: > >> my initial approach to taking

Re: taking top k values of rdd

2014-07-05 Thread Koert Kuipers
i guess i could create a single priorityque per partition, then shuffle to a new rdd with 1 partition, and then reduce? On Sat, Jul 5, 2014 at 1:16 PM, Koert Kuipers wrote: > my initial approach to taking top k values of a rdd was using a > priority-queue monoid. along these

taking top k values of rdd

2014-07-05 Thread Koert Kuipers
my initial approach to taking top k values of a rdd was using a priority-queue monoid. along these lines: rdd.mapPartitions({ items => Iterator.single(new PriorityQueue(...)) }, false).reduce(monoid.plus) this works fine, but looking at the code for reduce it first reduces within a partition (whi

Re: MLLib : Math on Vector and Matrix

2014-07-02 Thread Koert Kuipers
i did the second option: re-implemented .toBreeze as .breeze using pimp classes On Wed, Jul 2, 2014 at 5:00 PM, Thunder Stumpges wrote: > I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of > code working with internals of MLLib. One of the big changes was the move > from t

why is toBreeze private everywhere in mllib?

2014-07-01 Thread Koert Kuipers
its kind of handy to be able to convert stuff to breeze... is there some other way i am supposed to access that functionality?

graphx Joining two VertexPartitions with different indexes is slow.

2014-06-25 Thread Koert Kuipers
lately i am seeing a lot of this warning in graphx: org.apache.spark.graphx.impl.ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. i am using Graph.outerJoinVertices to join in data from a regular RDD (that is co-partitioned). i would like this operation to

Re: Spark's Hadooop Dependency

2014-06-25 Thread Koert Kuipers
libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % versionSpark % "provided" exclude("org.apache.hadoop", "hadoop-client") "org.apache.hadoop" % "hadoop-client" % versionHadoop % "provided" ) On Wed, Jun 25, 2014 at 11:26 AM, Robert James wrote: > To add Spark to a SBT projec

Re: Using Spark as web app backend

2014-06-24 Thread Koert Kuipers
run your spark app in client mode together with a spray rest service, that the front end can talk to On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa wrote: > Hi all, > > So far, I run my spark jobs with spark-shell or spark-submit command. I'd > like to go further and I wonder how to use spa

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-20 Thread Koert Kuipers
means everything must be submitted through spark-submit, which is fairly new and i am not sure how much we will use that yet. i will look into that some more. On Thu, Jun 19, 2014 at 6:56 PM, Koert Kuipers wrote: > for a jvm application its not very appealing to me to use spark submit &

Re: Running Spark alongside Hadoop

2014-06-20 Thread Koert Kuipers
for development/testing i think its fine to run them side by side as you suggested, using spark standalone. just be realistic about what size data you can load with limited RAM. On Fri, Jun 20, 2014 at 3:43 PM, Mayur Rustagi wrote: > The ideal way to do that is to use a cluster manager like Yar

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
ok solved it. as it happened in spark/conf i also had a file called core.site.xml (with some tachyone related stuff in it) so thats why it ignored /etc/hadoop/conf/core-site.xml On Fri, Jun 20, 2014 at 3:24 PM, Koert Kuipers wrote: > i put some logging statements in yarn.Client and t

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
from /etc/hadoop/conf/yarn-site.xml strange! On Fri, Jun 20, 2014 at 1:26 PM, Koert Kuipers wrote: > in /etc/hadoop/conf/core-site.xml: > > fs.defaultFS > hdfs://cdh5-yarn.tresata.com:8020 > > > > also hdfs seems the default: > [koert@cdh5-yarn ~]$ hado

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
PM, bc Wong wrote: > Koert, is there any chance that your fs.defaultFS isn't setup right? > > > On Fri, Jun 20, 2014 at 9:57 AM, Koert Kuipers wrote: > >> yeah sure see below. i strongly suspect its something i misconfigured >> causing yarn to try to

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
tion_1403201750110_0060 appUser: koert On Fri, Jun 20, 2014 at 12:42 PM, Marcelo Vanzin wrote: > Hi Koert, > > Could you provide more details? Job arguments, log messages, errors, etc. > > On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers wrote: > > i noticed that when

spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead of hdfs. what could cause this? in spark-env.sh i have HADOOP_CONF_DIR set correctly (and spark-submit does find yarn), and my core-site.xml has a fs.defaultFS that is hdfs, not local filesys

Re: trying to understand yarn-client mode

2014-06-20 Thread Koert Kuipers
that should solve the > problem. > > On Thu, Jun 19, 2014 at 10:22 AM, Koert Kuipers wrote: > > i am trying to understand how yarn-client mode works. i am not using > > Application application_1403117970283_0014 failed 2 times due to AM > > Container for appattempt_1403

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Koert Kuipers
s to launching Spark applications > will be done through spark-submit, so you may miss out on relevant new > features or bug fixes. > > Andrew > > > > 2014-06-19 7:41 GMT-07:00 Koert Kuipers : > > still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark >>

Re: trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
--- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers wrote: > > db tsai, > > if in yarn-cluster mode the driver runs inside yarn, how can you do a > > rdd.collect and bring the res

Re: trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
the > > driver's logic. Actions involve communication between local driver and > > remote cluster executors. So, there is some additional network overhead, > > especially if the driver is not co-located on the cluster. In > yarn-cluster > > mode -- in contrast, the

trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
i am trying to understand how yarn-client mode works. i am not using spark-submit, but instead launching a spark job from within my own application. i can see my application contacting yarn successfully, but then in yarn i get an immediate error: Application application_1403117970283_0014 failed

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Koert Kuipers
still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark standalone. for example if i have a akka timeout setting that i would like to be applied to every piece of the spark framework (so spark master, spark workers, spark executor sub-processes, spark-shell, etc.). i used to do th

Re: mismatched hdfs protocol

2014-06-04 Thread Koert Kuipers
you have to build spark against the version of hadoop your are using On Wed, Jun 4, 2014 at 10:25 PM, bluejoe2008 wrote: > hi, all > when my spark program accessed hdfs files > an error happened: > > Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC > version 9 cann

Re: life if an executor

2014-05-20 Thread Koert Kuipers
) On Tue, May 20, 2014 at 1:06 PM, Aaron Davidson wrote: > One issue is that new jars can be added during the lifetime of a > SparkContext, which can mean after executors are already started. Off-heap > storage is always serialized, correct. > > > On Tue, May 20, 2014 at 6:48

Re: life if an executor

2014-05-20 Thread Koert Kuipers
May 19, 2014, at 8:44 PM, Koert Kuipers wrote: > > from looking at the source code i see executors run in their own jvm > subprocesses. > > how long to they live for? as long as the worker/slave? or are they tied > to the sparkcontext and life/die with it? > > thx > > >

Re: life if an executor

2014-05-20 Thread Koert Kuipers
>> >> >> On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia >> wrote: >> >>> They’re tied to the SparkContext (application) that launched them. >>> >>> Matei >>> >>> On May 19, 2014, at 8:44 PM, Koert Kuipers wrote: >>> >

life if an executor

2014-05-19 Thread Koert Kuipers
from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx

Re: File present but file not found exception

2014-05-19 Thread Koert Kuipers
why does it need to be local file? why not do some filter ops on hdfs file and save to hdfs, from where you can create rdd? you can read a small file in on driver program and use sc.parallelize to turn it into RDD On May 16, 2014 7:01 PM, "Sai Prasanna" wrote: > I found that if a file is present

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
child and somehow this means the companion objects are reset or something like that because i get NPEs. On Fri, May 16, 2014 at 3:54 PM, Koert Kuipers wrote: > ok i think the issue is visibility: a classloader can see all classes > loaded by its parent classloader. but userClassLoader do

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
(JavaSerializer.scala:60) On Fri, May 16, 2014 at 1:46 PM, Koert Kuipers wrote: > after removing all class paramater of class Path from my code, i tried > again. different but related eror when i set > spark.files.userClassPathFirst=true > > now i dont even use FileInputFormat dire

java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
when i set spark.files.userClassPathFirst=true, i get java serialization errors in my tasks, see below. when i set userClassPathFirst back to its default of false, the serialization errors are gone. my spark.serializer is KryoSerializer. the class org.apache.hadoop.fs.Path is in the spark assembly

Re: cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
yeah sure. it is ubuntu 12.04 with jdk1.7.0_40 what else is relevant that i can provide? On Thu, May 15, 2014 at 12:17 PM, Sean Owen wrote: > FWIW I see no failures. Maybe you can say more about your environment, etc. > > On Wed, May 7, 2014 at 10:01 PM, Koert Kuipers wrote: > &g

writing my own RDD

2014-05-16 Thread Koert Kuipers
in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task killing (as HadoopRDD does), but the mechanics for that are inside the private InterruptibleIterator. also the exception i am supposed to thr

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
nputFormat which probably somewhere statically references FileInputFormat, which is invisible to userClassLoader. On Fri, May 16, 2014 at 3:32 PM, Koert Kuipers wrote: > ok i put lots of logging statements in the ChildExecutorURLClassLoader. > this is what i see: > > * the urls for userC

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
). i currently catch this NoClassDefFoundError and call parentClassLoader.loadClass but thats clearly not a solution since it loads the wrong version. On Fri, May 16, 2014 at 2:25 PM, Koert Kuipers wrote: > well, i modified ChildExecutorURLClassLoader to also delegate to > parentClassloa

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) On Thu, May 15, 2014 at 3:03 PM, Koert Kuipers wrote: > when i set spark.files.userClassPathFirst=true, i get java serialization > err

Re: cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
: > Since the error concerns a timeout -- is the machine slowish? > > What about blowing away everything in your local maven repo, do a > clean, etc. to rule out environment issues? > > I'm on OS X here FWIW. > > On Thu, May 15, 2014 at 5:24 PM, Koert Kuipers wrote: >

cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
i used to be able to get all tests to pass. with java 6 and sbt i get PermGen errors (no matter how high i make the PermGen). so i have given up on that. with java 7 i see 1 error in a bagel test and a few in streaming tests. any ideas? see the error in BagelSuite below. [info] - large number of

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Koert Kuipers
VectorWritable is not in mahout-math jar but in mahout-core jar, so you need to include both On Wed, May 14, 2014 at 3:43 AM, Stuti Awasthi wrote: > Hi Xiangrui, > Thanks for the response .. I tried few ways to include mahout-math jar > while launching Spark shell.. but no success.. Can you ple

Re: cant get tests to pass anymore on master master

2014-05-15 Thread Koert Kuipers
i did not save it. next time i try to run it i will also send those. it was also a timeout. On Mon, May 12, 2014 at 4:59 PM, Tathagata Das wrote: > Can you also send us the error you are seeing in the streaming suites? > > TD > > > On Sun, May 11, 2014 at 11:50 AM, Ko

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Koert Kuipers
spark.akka.frameSize", > "1").setAppName(...).setMaster(...) > val sc = new SparkContext(conf) > > - Patrick > > On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers wrote: > > i have some settings that i think are relevant for my application. they > are

little confused about SPARK_JAVA_OPTS alternatives

2014-05-14 Thread Koert Kuipers
i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS="-Dspark.akka.frameSize=1" now this is deprecated. the alternatives mentioned are: * some spark

Re: File present but file not found exception

2014-05-11 Thread Koert Kuipers
are you running spark on a cluster? if so, the executors will not be able to find a file on your local computer. On Thu, May 8, 2014 at 2:48 PM, Sai Prasanna wrote: > Hi Everyone, > > I think all are pretty busy, the response time in this group has slightly > increased. > > But anyways, this is

Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
will do On May 11, 2014 6:44 PM, "Aaron Davidson" wrote: > You got a good point there, those APIs should probably be marked as > @DeveloperAPI. Would you mind filing a JIRA for that ( > https://issues.apache.org/jira/browse/SPARK)? > > > On Sun, May 11, 2014 at 11

Re: cant get tests to pass anymore on master master

2014-05-11 Thread Koert Kuipers
resending because the list didnt seem to like my email before On Wed, May 7, 2014 at 5:01 PM, Koert Kuipers wrote: > i used to be able to get all tests to pass. > > with java 6 and sbt i get PermGen errors (no matter how high i make the > PermGen). so i have given up on that. >

Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
resending... my email somehow never made it to the user list. On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers wrote: > in writing my own RDD i ran into a few issues with respect to stuff being > private in spark. > > in compute i would like to return an iterator that respects task k

Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread Koert Kuipers
yes it seems broken. i got only a few emails in last few days On Fri, May 9, 2014 at 7:24 AM, wxhsdp wrote: > is there something wrong with the mailing list? very few people see my > thread > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/os-buffe

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Koert Kuipers
Hey Matei, Not sure i understand that. These are 2 separate jobs. So the second job takes advantage of the fact that there is map output left somewhere on disk from the first job, and re-uses that? On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia wrote: > Hi Diana, > > Apart from these reasons, in

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-02 Thread Koert Kuipers
gt; That did not work. I specified it in my email already. But I figured a way > around it by excluding akka dependencies > > Shivani > > > On Tue, Apr 29, 2014 at 12:37 PM, Koert Kuipers wrote: > >> you need to merge reference.conf files and its no longer an issue.

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-04-29 Thread Koert Kuipers
you need to merge reference.conf files and its no longer an issue. see the Build for for spark itself: case "reference.conf" => MergeStrategy.concat On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao wrote: > Hello folks, > > I was going to post this question to spark user group as well. If you ha

Re: Storage information about an RDD from the API

2014-04-29 Thread Koert Kuipers
SparkContext.getRDDStorageInfo On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth < andras.nem...@lynxanalytics.com> wrote: > Hi, > > Is it possible to know from code about an RDD if it is cached, and more > precisely, how many of its partitions are cached in memory and how many are > cached on dis

Re: ui broken in latest 1.0.0

2014-04-19 Thread Koert Kuipers
s here: https://issues.apache.org/jira/browse/SPARK-1538. > > Thanks again for reporting this. I will push out a fix shortly. > Andrew > > > On Tue, Apr 8, 2014 at 1:30 PM, Koert Kuipers wrote: > >> our one cached RDD in this run has id 3 >> >> >> >> &

Re: Anyone using value classes in RDDs?

2014-04-18 Thread Koert Kuipers
isn't valueclasses for primitives (AnyVal) only? that doesn't apply to string, which is an object (AnyRef) On Fri, Apr 18, 2014 at 2:51 PM, kamatsuoka wrote: > I'm wondering if anyone has tried using value classes in RDDs? My use case > is that I have a number of RDDs containing strings, e.g.

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
d_3_0 -> BlockStatus(StorageLevel(false, true, false, true, 1),19944,0,0 *** onStageCompleted ********** _rddInfoMap: Map() On Tue, Apr 8, 2014 at 4:20 PM, Koert Kuipers wrote: > 1) at the end of the callback > > 2) yes we simply expose sc.getRDDStorageInfo to the user via REST > > 3

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
into this on my side, but do let me know > once you have any updates. > > Andrew > > > On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers wrote: > >> yet at same time i can see via our own api: >> >> "storageInfo": { >> "diskSize"

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yet at same time i can see via our own api: "storageInfo": { "diskSize": 0, "memSize": 19944, "numCachedPartitions": 1, "numPartitions": 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers wrote: >

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
chyonSize: 0.0 B; DiskSize: 0.0 B) *** onStageCompleted ** Map() The storagelevels you see here are never the ones of my RDDs. and apparently updateRDDInfo never gets called (i had println in there too). On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers wrote: > yes

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
to run ./make-distribution.sh to > re-compile Spark first. -Xiangrui > > > On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers wrote: > >> sorry, i meant to say: note that for a cached rdd in the spark shell it >> all works fine. but something is going wrong with the SPARK-APPLICATION-U

<    1   2   3   4   5   6   >