Re: Understanding shuffle file name conflicts

2015-03-25 Thread Saisai Shao
Yes as Josh said, when application is started, Spark will create a unique application-wide folder for related temporary files. And jobs in this application will have a unique shuffle id with unique file names, so shuffle stages within app will not meet name conflicts. Also shuffle files between

Re: Understanding shuffle file name conflicts

2015-03-25 Thread Saisai Shao
DIskBlockManager doesn't need to know the app id, all it need to do is to create a folder with a unique name (UUID based) and then put all the shuffle files into it. you can see the code in DiskBlockManager as below, it will create a bunch unique folders when initialized, these folders are app

Can't assembly YARN project with SBT

2015-03-25 Thread Zoltán Zvara
Hi! I'm using the latest IntelliJ and I can't compile the yarn project into the Spark assembly fat JAR. That is why I'm getting a SparkException with message Unable to load YARN support. The yarn project is also missing from SBT tasks and I can't add it. How can I force sbt to include? Thanks!

RE: Understanding shuffle file name conflicts

2015-03-25 Thread Shao, Saisai
Hi Cheng, I think your scenario is acceptable for Spark's shuffle mechanism and will not occur shuffle file name conflicts. From my understanding I think the code snippet you mentioned is the same RDD graph, just running twice, these two jobs will generate 3 stages, map stage and collect

Re: Understanding shuffle file name conflicts

2015-03-25 Thread Cheng Lian
Ah, I see where I'm wrong here. What are reused here are the shuffle map output files themselves, rather than the file paths. No new shuffle map output files are generated for the 2nd job. Thanks! Really need to walk through Spark core code again :) Cheng On 3/25/15 9:31 PM, Shao, Saisai

functools.partial as UserDefinedFunction

2015-03-25 Thread Karlson
Hi all, passing a functools.partial-function as a UserDefinedFunction to DataFrame.select raises an AttributeException, because functools.partial does not have the attribute __name__. Is there any alternative to relying on __name__ in pyspark/sql/functions.py:126 ?

Re: hadoop input/output format advanced control

2015-03-25 Thread Koert Kuipers
my personal preference would be something like a Map[String, String] that only reflects the changes you want to make the Configuration for the given input/output format (so system wide defaults continue to come from sc.hadoopConfiguration), similarly to what cascading/scalding did, but am

Re: mllib.recommendation Design

2015-03-25 Thread Debasish Das
Hi Xiangrui, I am facing some minor issues in implementing Alternating Nonlinear Minimization as documented in this JIRA due to the ALS code being in ml package: https://issues.apache.org/jira/browse/SPARK-6323 I need to use Vectors.fromBreeze / Vectors.toBreeze but they are package private on

LogisticGradient Design

2015-03-25 Thread Debasish Das
Hi, Right now LogisticGradient implements both binary and multi-class in the same class using an if-else statement which is a bit convoluted. For Generalized matrix factorization, if the data has distinct ratings I want to use LeastSquareGradient (regression has given best results to date) but

jenkins upgraded to 1.606....

2015-03-25 Thread shane knapp
...due to some big security fixes: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-03-23 :) shane

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Yeah I agree that might have been nicer, but I think for consistency with the input API's maybe we should do the same thing. We can also give an example of how to clone sc.hadoopConfiguration and then set some new values: val conf = sc.hadoopConfiguration.clone() .set(k1, v1) .set(k2, v2)

Re: Jira Issues

2015-03-25 Thread Reynold Xin
Igor, Welcome -- everything is open here: https://issues.apache.org/jira/browse/SPARK You should be able to see them even if you are not an ASF member. On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote: Hi there Guys. I want to be more collaborative to Spark, but I

Re: Jira Issues

2015-03-25 Thread Sean Owen
It's just the standard Apache JIRA, nothing separate. I'd say JIRA is used to track issues, bugs, features, but Github is where the concrete changes to implement those things are discussed and merged. So for a non-trivial issue, you'd want to describe the issue in general in JIRA, and then open a

Re: Jira Issues

2015-03-25 Thread Igor Costa
Thank you guys for the info. Actually was a problem with my id on apache. Rather than need to be logged in to view issues. I'm browsing some issues now. Best Regards Igor Costa www.igorcosta.com www.igorcosta.org On Wed, Mar 25, 2015 at 5:58 PM, Sean Owen

Re: Jira Issues

2015-03-25 Thread Ted Yu
Issues are tracked on Apache JIRA: https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel Cheers On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote: Hi there Guys. I want to be more collaborative to Spark, but I have

Jira Issues

2015-03-25 Thread Igor Costa
Hi there Guys. I want to be more collaborative to Spark, but I have two questions. Issues are used in Github or jira Issues? If so on Jira, Is there a way I can get in to see the issues? I've tried to login but no success. I'm PMC from another Apache project, flex.apache.org Best Regards

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Sam Halliday
That would be a difficult task that would only benefit users of netlib-java. MultiBLAS is easily implemented (although a lot of boilerplate) and benefits all BLAS users on the system. If anyone knows of a funding route for it, I'd love to hear from them, because it's too much work for me to take

Re: LogisticGradient Design

2015-03-25 Thread DB Tsai
I did the benchmark when I used the if-else statement to switch the binary multinomial logistic loss and gradient, and there is no performance hit at all. However, I'm refactoring the LogisticGradient code so the addBias and scaling can be done in LogisticGradient instead of the input dataset to

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Sam Halliday
If you write it up I'll add it to the netlib-java wiki :-) BTW, does it automatically flip between cpu/GPU? I've a project called MultiBLAS which was going to do this, it should be easy (but boring to write) On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote: Alex - great stuff,

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
Sure, I will write a how-to after I re-check the results. -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Wednesday, March 25, 2015 3:04 PM To: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra If you write it

Re: LogisticGradient Design

2015-03-25 Thread Debasish Das
Cool...Thanks...It will be great if they move in two code paths just for the sake of code clean-up On Wed, Mar 25, 2015 at 2:37 PM, DB Tsai dbt...@dbtsai.com wrote: I did the benchmark when I used the if-else statement to switch the binary multinomial logistic loss and gradient, and there is

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from the provided libblas.so.3 library at the runtime. So, you can switch at the runtime by providing another library. Sam, please suggest if there is another way. From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] Sent:

Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza
Regarding Patrick's question, you can just do new Configuration(oldConf) to get a cloned Configuration object and add any new properties to it. -Sandy On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote: Hi Nick, I don't remember the exact details of these scenarios, but

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Great - that's even easier. Maybe we could have a simple example in the doc. On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Regarding Patrick's question, you can just do new Configuration(oldConf) to get a cloned Configuration object and add any new properties to it.

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default

Re: Can't assembly YARN project with SBT

2015-03-25 Thread Zoltán Zvara
Hi! It seems that the problem of unable to load YARN support present only when I run my job from code and by not using the spark-submit script. IMO this is related to SPARK-5144 https://issues.apache.org/jira/browse/SPARK-5144. I'm running QueueStream example with a single change:

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks
Alex - great stuff, and the nvblas numbers are pretty remarkable (almost too good... did you check the results for correctness? - also, is it possible that the unified memory model of nvblas is somehow hiding pci transfer time?) this last bit (getting nvblas + netlib-java to play together) sounds

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread jfcanny
Alex, I think you should recheck your numbers. Both BIDMat and nvblas are wrappers for cublas. The speeds are identical, except on machines that have multiple GPUs which nvblas exploits and cublas doesnt. It would be a good idea to add a column with Gflop throughput. Your numbers for BIDMat

Re: hadoop input/output format advanced control

2015-03-25 Thread Aaron Davidson
Should we mention that you should synchronize on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :) On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell pwend...@gmail.com wrote: Great - that's even easier.