Yes as Josh said, when application is started, Spark will create a unique
application-wide folder for related temporary files. And jobs in this
application will have a unique shuffle id with unique file names, so
shuffle stages within app will not meet name conflicts.
Also shuffle files between
DIskBlockManager doesn't need to know the app id, all it need to do is to
create a folder with a unique name (UUID based) and then put all the
shuffle files into it.
you can see the code in DiskBlockManager as below, it will create a bunch
unique folders when initialized, these folders are app
Hi!
I'm using the latest IntelliJ and I can't compile the yarn project into the
Spark assembly fat JAR. That is why I'm getting a SparkException with
message Unable to load YARN support. The yarn project is also missing
from SBT tasks and I can't add it. How can I force sbt to include?
Thanks!
Hi Cheng,
I think your scenario is acceptable for Spark's shuffle mechanism and will not
occur shuffle file name conflicts.
From my understanding I think the code snippet you mentioned is the same RDD
graph, just running twice, these two jobs will generate 3 stages, map stage and
collect
Ah, I see where I'm wrong here. What are reused here are the shuffle map
output files themselves, rather than the file paths. No new shuffle map
output files are generated for the 2nd job. Thanks! Really need to walk
through Spark core code again :)
Cheng
On 3/25/15 9:31 PM, Shao, Saisai
Hi all,
passing a functools.partial-function as a UserDefinedFunction to
DataFrame.select raises an AttributeException, because functools.partial
does not have the attribute __name__. Is there any alternative to
relying on __name__ in pyspark/sql/functions.py:126 ?
my personal preference would be something like a Map[String, String] that
only reflects the changes you want to make the Configuration for the given
input/output format (so system wide defaults continue to come from
sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
Hi Xiangrui,
I am facing some minor issues in implementing Alternating Nonlinear
Minimization as documented in this JIRA due to the ALS code being in ml
package: https://issues.apache.org/jira/browse/SPARK-6323
I need to use Vectors.fromBreeze / Vectors.toBreeze but they are package
private on
Hi,
Right now LogisticGradient implements both binary and multi-class in the
same class using an if-else statement which is a bit convoluted.
For Generalized matrix factorization, if the data has distinct ratings I
want to use LeastSquareGradient (regression has given best results to date)
but
...due to some big security fixes:
https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-03-23
:)
shane
Yeah I agree that might have been nicer, but I think for consistency
with the input API's maybe we should do the same thing. We can also
give an example of how to clone sc.hadoopConfiguration and then set
some new values:
val conf = sc.hadoopConfiguration.clone()
.set(k1, v1)
.set(k2, v2)
Igor,
Welcome -- everything is open here:
https://issues.apache.org/jira/browse/SPARK
You should be able to see them even if you are not an ASF member.
On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote:
Hi there Guys.
I want to be more collaborative to Spark, but I
It's just the standard Apache JIRA, nothing separate.
I'd say JIRA is used to track issues, bugs, features, but Github is
where the concrete changes to implement those things are discussed and
merged. So for a non-trivial issue, you'd want to describe the issue
in general in JIRA, and then open a
Thank you guys for the info.
Actually was a problem with my id on apache. Rather than need to be logged
in to view issues.
I'm browsing some issues now.
Best Regards
Igor Costa
www.igorcosta.com
www.igorcosta.org
On Wed, Mar 25, 2015 at 5:58 PM, Sean Owen
Issues are tracked on Apache JIRA:
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
Cheers
On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote:
Hi there Guys.
I want to be more collaborative to Spark, but I have
Hi there Guys.
I want to be more collaborative to Spark, but I have two questions.
Issues are used in Github or jira Issues?
If so on Jira, Is there a way I can get in to see the issues?
I've tried to login but no success.
I'm PMC from another Apache project, flex.apache.org
Best Regards
Hi again,
I finally managed to use nvblas within Spark+netlib-java. It has exceptional
performance for big matrices with Double, faster than BIDMat-cuda with Float.
But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL
might be a better choice. This correlates with
That would be a difficult task that would only benefit users of
netlib-java. MultiBLAS is easily implemented (although a lot of
boilerplate) and benefits all BLAS users on the system.
If anyone knows of a funding route for it, I'd love to hear from them,
because it's too much work for me to take
I did the benchmark when I used the if-else statement to switch the
binary multinomial logistic loss and gradient, and there is no
performance hit at all. However, I'm refactoring the LogisticGradient
code so the addBias and scaling can be done in LogisticGradient
instead of the input dataset to
If you write it up I'll add it to the netlib-java wiki :-)
BTW, does it automatically flip between cpu/GPU? I've a project called
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote:
Alex - great stuff,
Sure, I will write a how-to after I re-check the results.
-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Wednesday, March 25, 2015 3:04 PM
To: Evan R. Sparks; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra
If you write it
Cool...Thanks...It will be great if they move in two code paths just for
the sake of code clean-up
On Wed, Mar 25, 2015 at 2:37 PM, DB Tsai dbt...@dbtsai.com wrote:
I did the benchmark when I used the if-else statement to switch the
binary multinomial logistic loss and gradient, and there is
Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from the
provided libblas.so.3 library at the runtime. So, you can switch at the runtime
by providing another library. Sam, please suggest if there is another way.
From: Dmitriy Lyubimov [mailto:dlie...@gmail.com]
Sent:
Regarding Patrick's question, you can just do new Configuration(oldConf)
to get a cloned Configuration object and add any new properties to it.
-Sandy
On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote:
Hi Nick,
I don't remember the exact details of these scenarios, but
Great - that's even easier. Maybe we could have a simple example in the doc.
On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Regarding Patrick's question, you can just do new Configuration(oldConf)
to get a cloned Configuration object and add any new properties to it.
As everyone suggested, the results were too good to be true, so I
double-checked them. It turns that nvblas did not do multiplication due to
parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My
previously posted results with nvblas are matrices copying only. The default
Hi!
It seems that the problem of unable to load YARN support present only
when I run my job from code and by not using the spark-submit script. IMO
this is related to SPARK-5144
https://issues.apache.org/jira/browse/SPARK-5144. I'm running QueueStream
example with a single change:
Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
too good... did you check the results for correctness? - also, is it
possible that the unified memory model of nvblas is somehow hiding pci
transfer time?)
this last bit (getting nvblas + netlib-java to play together) sounds
Alex,
I think you should recheck your numbers. Both BIDMat and nvblas are
wrappers for cublas. The speeds are identical, except on machines that
have multiple GPUs which nvblas exploits and cublas doesnt.
It would be a good idea to add a column with Gflop throughput. Your
numbers for BIDMat
Should we mention that you should synchronize
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race
condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :)
On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell pwend...@gmail.com wrote:
Great - that's even easier.
30 matches
Mail list logo