Re: Understanding shuffle file name conflicts

2015-03-26 Thread Kannan Rajah
Thanks folks. I understood the workflow. I noticed there is some code in
Worker.scala that creates app specific local dir.


--
Kannan

On Wed, Mar 25, 2015 at 7:33 AM, Cheng Lian  wrote:

> Ah, I see where I'm wrong here. What are reused here are the shuffle map
> output files themselves, rather than the file paths. No new shuffle map
> output files are generated for the 2nd job. Thanks! Really need to walk
> through Spark core code again :)
>
> Cheng
>
>
> On 3/25/15 9:31 PM, Shao, Saisai wrote:
>
>> Hi Cheng,
>>
>> I think your scenario is acceptable for Spark's shuffle mechanism and
>> will not occur shuffle file name conflicts.
>>
>>  From my understanding I think the code snippet you mentioned is the same
>> RDD graph, just running twice, these two jobs will generate 3 stages, map
>> stage and collect stage for the first job, only collect stage for the
>> second job (map stage is the same as previous job). So these two jobs will
>> only generate one copy of shuffle files in the first job, and fetch the
>> shuffle data twice for each job. So name conflicts will not be occurred,
>> since these two jobs rely on the same ShuffledRDD.
>>
>> I think only shuffle write which generates shuffle files will have chance
>> to meet name conflicts, multiple times of shuffle read is acceptable as the
>> code snippet shows.
>>
>> Thanks
>> Jerry
>>
>>
>>
>> -Original Message-
>> From: Cheng Lian [mailto:lian.cs@gmail.com]
>> Sent: Wednesday, March 25, 2015 7:40 PM
>> To: Saisai Shao; Kannan Rajah
>> Cc: dev@spark.apache.org
>> Subject: Re: Understanding shuffle file name conflicts
>>
>> Hi Jerry & Josh
>>
>> It has been a while since the last time I looked into Spark core shuffle
>> code, maybe I’m wrong here. But the shuffle ID is created along with
>> ShuffleDependency, which is part of the RDD DAG. So if we submit multiple
>> jobs over the same RDD DAG, I think the shuffle IDs in these jobs should
>> duplicate. For example:
>>
>> |val  dag  =  sc.parallelize(Array(1,2,3)).map(i => i ->
>> |i).reduceByKey(_ + _)
>> dag.collect()
>> dag.collect()
>> |
>>
>>   From the debug log output, I did see duplicated shuffle IDs in both
>> jobs. Something like this:
>>
>> |# Job 1
>> 15/03/25 19:26:34 DEBUG BlockStoreShuffleFetcher: Fetching outputs for
>> shuffle 0, reduce 2
>>
>> # Job 2
>> 15/03/25 19:26:36 DEBUG BlockStoreShuffleFetcher: Fetching outputs for
>> shuffle 0, reduce 5
>> |
>>
>> So it’s also possible that some shuffle output files get reused in
>> different jobs. But Kannan, did you submit separate jobs over the same RDD
>> DAG as I did above? If not, I’d agree with Jerry and Josh.
>>
>> (Did I miss something here?)
>>
>> Cheng
>>
>> On 3/25/15 10:35 AM, Saisai Shao wrote:
>>
>>  Hi Kannan,
>>>
>>> As I know the shuffle Id in ShuffleDependency will be increased, so
>>> even if you run the same job twice, the shuffle dependency as well as
>>> shuffle id is different, so the shuffle file name which is combined by
>>> (shuffleId+mapId+reduceId) will be changed, so there's no name
>>> conflict even in the same directory as I know.
>>>
>>> Thanks
>>> Jerry
>>>
>>>
>>> 2015-03-25 1:56 GMT+08:00 Kannan Rajah :
>>>
>>>  I am working on SPARK-1529. I ran into an issue with my change, where
 the same shuffle file was being reused across 2 jobs. Please note
 this only happens when I use a hard coded location to use for shuffle
 files, say "/tmp". It does not happen with normal code path that uses
 DiskBlockManager to pick different directories for each run. So I
 want to understand how DiskBlockManager guarantees that such a conflict
 will never happen.

 Let's say the shuffle block id has a value of shuffle_0_0_0. So the
 data file name is shuffle_0_0_0.data and index file name is
 shuffle_0_0_0.index.
 If I run a spark job twice, one after another, these files get
 created under different directories because of the hashing logic in
 DiskBlockManager. But the hash is based off the file name, so how are
 we sure that there won't be a conflict ever?

 --
 Kannan

  ​
>>
>
>


Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Pala M Muthaia
+spark-dev

Yes, the dependencies are there. I guess my question is how come the build
is succeeding in the mainline then, without adding these dependencies?

On Thu, Mar 26, 2015 at 3:44 PM, Ted Yu  wrote:

> Looking at output from dependency:tree, servlet-api is brought in by the
> following:
>
> [INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile
> [INFO] |  +- org.antlr:antlr:jar:3.2:compile
> [INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1:compile
> [INFO] |  +- org.yaml:snakeyaml:jar:1.6:compile
> [INFO] |  +- edu.stanford.ppl:snaptree:jar:0.1:compile
> [INFO] |  +- org.mindrot:jbcrypt:jar:0.3m:compile
> [INFO] |  +- org.apache.thrift:libthrift:jar:0.7.0:compile
> [INFO] |  |  \- javax.servlet:servlet-api:jar:2.5:compile
>
> FYI
>
> On Thu, Mar 26, 2015 at 3:36 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com> wrote:
>
>> Hi,
>>
>> We are trying to build spark 1.2 from source (tip of the branch-1.2 at
>> the moment). I tried to build spark using the following command:
>>
>> mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
>> -Phive-thriftserver -DskipTests clean package
>>
>> I encountered various missing class definition exceptions (e.g: class
>> javax.servlet.ServletException not found).
>>
>> I eventually got the build to succeed after adding the following set of
>> dependencies to the spark-core's pom.xml:
>>
>> 
>>   javax.servlet
>>   *servlet-api*
>>   3.0
>> 
>>
>> 
>>   org.eclipse.jetty
>>   *jetty-io*
>> 
>>
>> 
>>   org.eclipse.jetty
>>   *jetty-http*
>> 
>>
>> 
>>   org.eclipse.jetty
>>   *jetty-servlet*
>> 
>>
>> Pretty much all of the missing class definition errors came up while
>> building HttpServer.scala, and went away after the above dependencies were
>> included.
>>
>> My guess is official build for spark 1.2 is working already. My question
>> is what is wrong with my environment or setup, that requires me to add
>> dependencies to pom.xml in this manner, to get this build to succeed.
>>
>> Also, i am not sure if this build would work at runtime for us, i am
>> still testing this out.
>>
>>
>> Thanks,
>> pala
>>
>
>


RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
Thanks, Jeremy! I also work with time series data right now, so your 
suggestions are really relevant. However, we want to handle not the raw data, 
but already processed and prepared for machine learning.

Initially, we also wanted to have our own simple binary format, but we could 
not agree on handling little/big endian. We did not agree if we have to stick 
to a specific endian or to ship this information in metadata file. And metadata 
file sounds like another data format engineering (aka inventing the bicycle). 
Does this make sense to you?

From: Jeremy Freeman [mailto:freeman.jer...@gmail.com]
Sent: Thursday, March 26, 2015 3:01 PM
To: Ulanov, Alexander
Cc: Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

Hi Ulvanov, great question, we've encountered it frequently with scientific 
data (e.g. time series). Agreed text is inefficient for dense arrays, and we 
also found HDF5+Spark to be a pain.

Our strategy has been flat binary files with fixed length records. Loading 
these is now supported in Spark via the binaryRecords method, which wraps a 
custom Hadoop InputFormat we wrote.

An example (in python):

# write data from an array
from numpy import random
dat = random.randn(100,5)
f = open('test.bin', 'w')
f.write(dat)
f.close()

# load the data back in
from numpy import frombuffer
nrecords = 5
bytesize = 8
recordsize = nrecords * bytesize
data = sc.binaryRecords('test.bin', recordsize)
parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))

# these should be equal
parsed.first()
dat[0,:]

Compared to something like Parquet, this is a little lighter-weight, and plays 
nicer with non-distributed data science tools (e.g. numpy). It also scales 
great (we use it routinely to process TBs of time series). And handles single 
files or directories. But it's extremely simple!

-
jeremyfreeman.net
@thefreemanlab

On Mar 26, 2015, at 2:33 PM, Ulanov, Alexander 
mailto:alexander.ula...@hp.com>> wrote:


Thanks for suggestion, but libsvm is a format for sparse data storing in text 
file and I have dense vectors. In my opinion, text format is not appropriate 
for storing large dense vectors due to overhead related to parsing from string 
to digits and also storing digits as strings is not efficient.

From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, March 26, 2015 2:27 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

There are some convenience methods you might consider including:

  MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
mailto:alexander.ula...@hp.com>>:
Hi,

Could you suggest what would be the reasonable file format to store feature 
vector data for machine learning in Spark MLlib? Are there any best practices 
for Spark?

My data is dense feature vectors with labels. Some of the requirements are that 
the format should be easy loaded/serialized, randomly accessible, with a small 
footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), 
but I have little to no experience with them, so any suggestions would be 
really appreciated.

Best regards, Alexander



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks all for the quick response.

Thanks.

Zhan Zhang

On Mar 26, 2015, at 3:14 PM, Patrick Wendell  wrote:

> I think we have a version of mapPartitions that allows you to tell
> Spark the partitioning is preserved:
> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639
> 
> We could also add a map function that does same. Or you can just write
> your map using an iterator.
> 
> - Patrick
> 
> On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney  wrote:
>> This is just a deficiency of the api, imo. I agree: mapValues could
>> definitely be a function (K, V)=>V1. The option isn't set by the function,
>> it's on the RDD. So you could look at the code and do this.
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
>> 
>> def mapValues[U](f: V => U): RDD[(K, U)] = {
>>val cleanF = self.context.clean(f)
>>new MapPartitionsRDD[(K, U), (K, V)](self,
>>  (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
>>  preservesPartitioning = true)
>>  }
>> 
>> What you want:
>> 
>> def mapValues[U](f: (K, V) => U): RDD[(K, U)] = {
>>val cleanF = self.context.clean(f)
>>new MapPartitionsRDD[(K, U), (K, V)](self,
>>  (context, pid, iter) => iter.map { case t@(k, _) => (k, cleanF(t)) },
>>  preservesPartitioning = true)
>>  }
>> 
>> One of the nice things about spark is that making such new operators is very
>> easy :)
>> 
>> 2015-03-26 17:54 GMT-04:00 Zhan Zhang :
>> 
>>> Thanks Jonathan. You are right regarding rewrite the example.
>>> 
>>> I mean providing such option to developer so that it is controllable. The
>>> example may seems silly, and I don't know the use cases.
>>> 
>>> But for example, if I also want to operate both the key and value part to
>>> generate some new value with keeping key part untouched. Then mapValues may
>>> not be able to  do this.
>>> 
>>> Changing the code to allow this is trivial, but I don't know whether there
>>> is some special reason behind this.
>>> 
>>> Thanks.
>>> 
>>> Zhan Zhang
>>> 
>>> 
>>> 
>>> 
>>> On Mar 26, 2015, at 2:49 PM, Jonathan Coveney  wrote:
>>> 
>>> I believe if you do the following:
>>> 
>>> 
>>> sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
>>> 
>>> (8) MapPartitionsRDD[34] at reduceByKey at :23 []
>>> |  MapPartitionsRDD[33] at mapValues at :23 []
>>> |  ShuffledRDD[32] at reduceByKey at :23 []
>>> +-(8) MapPartitionsRDD[31] at map at :23 []
>>>|  ParallelCollectionRDD[30] at parallelize at :23 []
>>> 
>>> The difference is that spark has no way to know that your map closure
>>> doesn't change the key. if you only use mapValues, it does. Pretty cool that
>>> they optimized that :)
>>> 
>>> 2015-03-26 17:44 GMT-04:00 Zhan Zhang :
 
 Hi Folks,
 
 Does anybody know what is the reason not allowing preserverPartitioning
 in RDD.map? Do I miss something here?
 
 Following example involves two shuffles. I think if preservePartitioning
 is allowed, we can avoid the second one, right?
 
 val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
 val r2 = r1.map((_, 1))
 val r3 = r2.reduceByKey(_+_)
 val r4 = r3.map(x=>(x._1, x._2 + 1))
 val r5 = r4.reduceByKey(_+_)
 r5.collect.foreach(println)
 
 scala> r5.toDebugString
 res2: String =
 (8) ShuffledRDD[4] at reduceByKey at :29 []
 +-(8) MapPartitionsRDD[3] at map at :27 []
|  ShuffledRDD[2] at reduceByKey at :25 []
+-(8) MapPartitionsRDD[1] at map at :23 []
   |  ParallelCollectionRDD[0] at parallelize at :21 []
 
 Thanks.
 
 Zhan Zhang
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
>>> 
>>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639

We could also add a map function that does same. Or you can just write
your map using an iterator.

- Patrick

On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney  wrote:
> This is just a deficiency of the api, imo. I agree: mapValues could
> definitely be a function (K, V)=>V1. The option isn't set by the function,
> it's on the RDD. So you could look at the code and do this.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
>
>  def mapValues[U](f: V => U): RDD[(K, U)] = {
> val cleanF = self.context.clean(f)
> new MapPartitionsRDD[(K, U), (K, V)](self,
>   (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
>   preservesPartitioning = true)
>   }
>
> What you want:
>
>  def mapValues[U](f: (K, V) => U): RDD[(K, U)] = {
> val cleanF = self.context.clean(f)
> new MapPartitionsRDD[(K, U), (K, V)](self,
>   (context, pid, iter) => iter.map { case t@(k, _) => (k, cleanF(t)) },
>   preservesPartitioning = true)
>   }
>
> One of the nice things about spark is that making such new operators is very
> easy :)
>
> 2015-03-26 17:54 GMT-04:00 Zhan Zhang :
>
>> Thanks Jonathan. You are right regarding rewrite the example.
>>
>> I mean providing such option to developer so that it is controllable. The
>> example may seems silly, and I don't know the use cases.
>>
>> But for example, if I also want to operate both the key and value part to
>> generate some new value with keeping key part untouched. Then mapValues may
>> not be able to  do this.
>>
>> Changing the code to allow this is trivial, but I don't know whether there
>> is some special reason behind this.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>>
>>
>>
>> On Mar 26, 2015, at 2:49 PM, Jonathan Coveney  wrote:
>>
>> I believe if you do the following:
>>
>>
>> sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
>>
>> (8) MapPartitionsRDD[34] at reduceByKey at :23 []
>>  |  MapPartitionsRDD[33] at mapValues at :23 []
>>  |  ShuffledRDD[32] at reduceByKey at :23 []
>>  +-(8) MapPartitionsRDD[31] at map at :23 []
>> |  ParallelCollectionRDD[30] at parallelize at :23 []
>>
>> The difference is that spark has no way to know that your map closure
>> doesn't change the key. if you only use mapValues, it does. Pretty cool that
>> they optimized that :)
>>
>> 2015-03-26 17:44 GMT-04:00 Zhan Zhang :
>>>
>>> Hi Folks,
>>>
>>> Does anybody know what is the reason not allowing preserverPartitioning
>>> in RDD.map? Do I miss something here?
>>>
>>> Following example involves two shuffles. I think if preservePartitioning
>>> is allowed, we can avoid the second one, right?
>>>
>>>  val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
>>>  val r2 = r1.map((_, 1))
>>>  val r3 = r2.reduceByKey(_+_)
>>>  val r4 = r3.map(x=>(x._1, x._2 + 1))
>>>  val r5 = r4.reduceByKey(_+_)
>>>  r5.collect.foreach(println)
>>>
>>> scala> r5.toDebugString
>>> res2: String =
>>> (8) ShuffledRDD[4] at reduceByKey at :29 []
>>>  +-(8) MapPartitionsRDD[3] at map at :27 []
>>> |  ShuffledRDD[2] at reduceByKey at :25 []
>>> +-(8) MapPartitionsRDD[1] at map at :23 []
>>>|  ParallelCollectionRDD[0] at parallelize at :21 []
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney
This is just a deficiency of the api, imo. I agree: mapValues could
definitely be a function (K, V)=>V1. The option isn't set by the function,
it's on the RDD. So you could look at the code and do this.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

 def mapValues[U](f: V => U): RDD[(K, U)] = {
val cleanF = self.context.clean(f)
new MapPartitionsRDD[(K, U), (K, V)](self,
  (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
  preservesPartitioning = true)
  }

What you want:

 def mapValues[U](f: (K, V) => U): RDD[(K, U)] = {
val cleanF = self.context.clean(f)
new MapPartitionsRDD[(K, U), (K, V)](self,
  (context, pid, iter) => iter.map { case t@(k, _) => (k, cleanF(t)) },
  preservesPartitioning = true)
  }

One of the nice things about spark is that making such new operators is
very easy :)

2015-03-26 17:54 GMT-04:00 Zhan Zhang :

>  Thanks Jonathan. You are right regarding rewrite the example.
>
>  I mean providing such option to developer so that it is controllable.
> The example may seems silly, and I don’t know the use cases.
>
> But for example, if I also want to operate both the key and value part to
> generate some new value with keeping key part untouched. Then mapValues may
> not be able to  do this.
>
>  Changing the code to allow this is trivial, but I don’t know whether
> there is some special reason behind this.
>
>  Thanks.
>
>  Zhan Zhang
>
>
>
>
>  On Mar 26, 2015, at 2:49 PM, Jonathan Coveney  wrote:
>
>  I believe if you do the following:
>
>
> sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
>
>  (8) MapPartitionsRDD[34] at reduceByKey at :23 []
>  |  MapPartitionsRDD[33] at mapValues at :23 []
>  |  ShuffledRDD[32] at reduceByKey at :23 []
>  +-(8) MapPartitionsRDD[31] at map at :23 []
> |  ParallelCollectionRDD[30] at parallelize at :23 []
>
>  The difference is that spark has no way to know that your map closure
> doesn't change the key. if you only use mapValues, it does. Pretty cool
> that they optimized that :)
>
> 2015-03-26 17:44 GMT-04:00 Zhan Zhang :
>
>> Hi Folks,
>>
>> Does anybody know what is the reason not allowing preserverPartitioning
>> in RDD.map? Do I miss something here?
>>
>> Following example involves two shuffles. I think if preservePartitioning
>> is allowed, we can avoid the second one, right?
>>
>>  val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
>>  val r2 = r1.map((_, 1))
>>  val r3 = r2.reduceByKey(_+_)
>>  val r4 = r3.map(x=>(x._1, x._2 + 1))
>>  val r5 = r4.reduceByKey(_+_)
>>  r5.collect.foreach(println)
>>
>> scala> r5.toDebugString
>> res2: String =
>> (8) ShuffledRDD[4] at reduceByKey at :29 []
>>  +-(8) MapPartitionsRDD[3] at map at :27 []
>> |  ShuffledRDD[2] at reduceByKey at :25 []
>> +-(8) MapPartitionsRDD[1] at map at :23 []
>>|  ParallelCollectionRDD[0] at parallelize at :21 []
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
Protobufs are great for serializing individual records - but parquet is
good for efficiently storing a whole bunch of these objects.

Matt Massie has a good (slightly dated) blog post on using
Spark+Parquet+Avro (and you can pretty much s/Avro/Protobuf/) describing
how they all work together here:
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

Your use case (storing dense features, presumably as a single column) is
pretty straightforward and the extra layers of indirection are maybe
overkill.

Lastly - you might consider using some of SparkSQL/DataFrame's built-in
features for persistence, which support lots of storage backends.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources

On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander 
wrote:

>  Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Thursday, March 26, 2015 2:34 PM
> *To:* Stephen Boesch
> *Cc:* Ulanov, Alexander; dev@spark.apache.org
> *Subject:* Re: Storing large data for MLlib machine learning
>
>
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
>
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
>
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch  wrote:
>
> There are some convenience methods you might consider including:
>
>MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander :
>
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>
>


Re: Storing large data for MLlib machine learning

2015-03-26 Thread Jeremy Freeman
Hi Ulvanov, great question, we’ve encountered it frequently with scientific 
data (e.g. time series). Agreed text is inefficient for dense arrays, and we 
also found HDF5+Spark to be a pain.
 
Our strategy has been flat binary files with fixed length records. Loading 
these is now supported in Spark via the binaryRecords method, which wraps a 
custom Hadoop InputFormat we wrote.

An example (in python):

> # write data from an array
> from numpy import random
> dat = random.randn(100,5)
> f = open('test.bin', 'w')
> f.write(dat)
> f.close()

> # load the data back in
> from numpy import frombuffer
> nrecords = 5
> bytesize = 8
> recordsize = nrecords * bytesize
> data = sc.binaryRecords('test.bin', recordsize)
> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))

> # these should be equal
> parsed.first()
> dat[0,:]

Compared to something like Parquet, this is a little lighter-weight, and plays 
nicer with non-distributed data science tools (e.g. numpy). It also scales 
great (we use it routinely to process TBs of time series). And handles single 
files or directories. But it's extremely simple!

-
jeremyfreeman.net
@thefreemanlab

On Mar 26, 2015, at 2:33 PM, Ulanov, Alexander  wrote:

> Thanks for suggestion, but libsvm is a format for sparse data storing in text 
> file and I have dense vectors. In my opinion, text format is not appropriate 
> for storing large dense vectors due to overhead related to parsing from 
> string to digits and also storing digits as strings is not efficient.
> 
> From: Stephen Boesch [mailto:java...@gmail.com]
> Sent: Thursday, March 26, 2015 2:27 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org
> Subject: Re: Storing large data for MLlib machine learning
> 
> There are some convenience methods you might consider including:
> 
>   MLUtils.loadLibSVMFile
> 
> and   MLUtils.loadLabeledPoint
> 
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
> mailto:alexander.ula...@hp.com>>:
> Hi,
> 
> Could you suggest what would be the reasonable file format to store feature 
> vector data for machine learning in Spark MLlib? Are there any best practices 
> for Spark?
> 
> My data is dense feature vectors with labels. Some of the requirements are 
> that the format should be easy loaded/serialized, randomly accessible, with a 
> small footprint (binary). I am considering Parquet, hdf5, protocol buffer 
> (protobuf), but I have little to no experience with them, so any suggestions 
> would be really appreciated.
> 
> Best regards, Alexander
> 



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks Jonathan. You are right regarding rewrite the example.

I mean providing such option to developer so that it is controllable. The 
example may seems silly, and I don’t know the use cases.

But for example, if I also want to operate both the key and value part to 
generate some new value with keeping key part untouched. Then mapValues may not 
be able to  do this.

Changing the code to allow this is trivial, but I don’t know whether there is 
some special reason behind this.

Thanks.

Zhan Zhang



On Mar 26, 2015, at 2:49 PM, Jonathan Coveney 
mailto:jcove...@gmail.com>> wrote:

I believe if you do the following:

sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString

(8) MapPartitionsRDD[34] at reduceByKey at :23 []
 |  MapPartitionsRDD[33] at mapValues at :23 []
 |  ShuffledRDD[32] at reduceByKey at :23 []
 +-(8) MapPartitionsRDD[31] at map at :23 []
|  ParallelCollectionRDD[30] at parallelize at :23 []

The difference is that spark has no way to know that your map closure doesn't 
change the key. if you only use mapValues, it does. Pretty cool that they 
optimized that :)

2015-03-26 17:44 GMT-04:00 Zhan Zhang 
mailto:zzh...@hortonworks.com>>:
Hi Folks,

Does anybody know what is the reason not allowing preserverPartitioning in 
RDD.map? Do I miss something here?

Following example involves two shuffles. I think if preservePartitioning is 
allowed, we can avoid the second one, right?

 val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
 val r2 = r1.map((_, 1))
 val r3 = r2.reduceByKey(_+_)
 val r4 = r3.map(x=>(x._1, x._2 + 1))
 val r5 = r4.reduceByKey(_+_)
 r5.collect.foreach(println)

scala> r5.toDebugString
res2: String =
(8) ShuffledRDD[4] at reduceByKey at :29 []
 +-(8) MapPartitionsRDD[3] at map at :27 []
|  ShuffledRDD[2] at reduceByKey at :25 []
+-(8) MapPartitionsRDD[1] at map at :23 []
   |  ParallelCollectionRDD[0] at parallelize at :21 []

Thanks.

Zhan Zhang

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org





RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage 
protobuf files in hdfs https://github.com/twitter/elephant-bird


From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and found 
it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed 
filenames as input, you couldn't pass it anything like an InputStream). I don't 
know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects 
using it already. Keep in mind that it's column-oriented which might impact 
performance - but basically you're going to want your features in a byte array 
and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch 
mailto:java...@gmail.com>> wrote:
There are some convenience methods you might consider including:

   MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
mailto:alexander.ula...@hp.com>>:

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney
I believe if you do the following:

sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString

(8) MapPartitionsRDD[34] at reduceByKey at :23 []
 |  MapPartitionsRDD[33] at mapValues at :23 []
 |  ShuffledRDD[32] at reduceByKey at :23 []
 +-(8) MapPartitionsRDD[31] at map at :23 []
|  ParallelCollectionRDD[30] at parallelize at :23 []

The difference is that spark has no way to know that your map closure
doesn't change the key. if you only use mapValues, it does. Pretty cool
that they optimized that :)

2015-03-26 17:44 GMT-04:00 Zhan Zhang :

> Hi Folks,
>
> Does anybody know what is the reason not allowing preserverPartitioning in
> RDD.map? Do I miss something here?
>
> Following example involves two shuffles. I think if preservePartitioning
> is allowed, we can avoid the second one, right?
>
>  val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
>  val r2 = r1.map((_, 1))
>  val r3 = r2.reduceByKey(_+_)
>  val r4 = r3.map(x=>(x._1, x._2 + 1))
>  val r5 = r4.reduceByKey(_+_)
>  r5.collect.foreach(println)
>
> scala> r5.toDebugString
> res2: String =
> (8) ShuffledRDD[4] at reduceByKey at :29 []
>  +-(8) MapPartitionsRDD[3] at map at :27 []
> |  ShuffledRDD[2] at reduceByKey at :25 []
> +-(8) MapPartitionsRDD[1] at map at :23 []
>|  ParallelCollectionRDD[0] at parallelize at :21 []
>
> Thanks.
>
> Zhan Zhang
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Hi Folks,

Does anybody know what is the reason not allowing preserverPartitioning in 
RDD.map? Do I miss something here?

Following example involves two shuffles. I think if preservePartitioning is 
allowed, we can avoid the second one, right?

 val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
 val r2 = r1.map((_, 1))
 val r3 = r2.reduceByKey(_+_)
 val r4 = r3.map(x=>(x._1, x._2 + 1))
 val r5 = r4.reduceByKey(_+_)
 r5.collect.foreach(println)

scala> r5.toDebugString
res2: String =
(8) ShuffledRDD[4] at reduceByKey at :29 []
 +-(8) MapPartitionsRDD[3] at map at :27 []
|  ShuffledRDD[2] at reduceByKey at :25 []
+-(8) MapPartitionsRDD[1] at map at :23 []
   |  ParallelCollectionRDD[0] at parallelize at :21 []

Thanks.

Zhan Zhang

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects
using it already. Keep in mind that it's column-oriented which might impact
performance - but basically you're going to want your features in a byte
array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch  wrote:

> There are some convenience methods you might consider including:
>
>MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander :
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>


RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
Thanks for suggestion, but libsvm is a format for sparse data storing in text 
file and I have dense vectors. In my opinion, text format is not appropriate 
for storing large dense vectors due to overhead related to parsing from string 
to digits and also storing digits as strings is not efficient.

From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, March 26, 2015 2:27 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

There are some convenience methods you might consider including:

   MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
mailto:alexander.ula...@hp.com>>:
Hi,

Could you suggest what would be the reasonable file format to store feature 
vector data for machine learning in Spark MLlib? Are there any best practices 
for Spark?

My data is dense feature vectors with labels. Some of the requirements are that 
the format should be easy loaded/serialized, randomly accessible, with a small 
footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), 
but I have little to no experience with them, so any suggestions would be 
really appreciated.

Best regards, Alexander



Re: Storing large data for MLlib machine learning

2015-03-26 Thread Stephen Boesch
There are some convenience methods you might consider including:

   MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander :

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>


Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
Hi,

Could you suggest what would be the reasonable file format to store feature 
vector data for machine learning in Spark MLlib? Are there any best practices 
for Spark?

My data is dense feature vectors with labels. Some of the requirements are that 
the format should be easy loaded/serialized, randomly accessible, with a small 
footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), 
but I have little to no experience with them, so any suggestions would be 
really appreciated.

Best regards, Alexander


Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sean Owen
The license issue is with libgfortran, rather than OpenBLAS.

(FWIW I am going through the motions to get OpenBLAS set up by default
on CDH in the near future, and the hard part is just handling
libgfortran.)

On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks  wrote:
> Alright Sam - you are the expert here. If the GPL issues are unavoidable,
> that's fine - what is the exact bit of code that is GPL?
>
> The suggestion to use OpenBLAS is not to say it's the best option, but that
> it's a *free, reasonable default* for many users - keep in mind the most
> common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
> Additionally, for many of the problems we're targeting, this reasonable
> default can provide a 1-2 orders of magnitude improvement in performance
> over the f2jblas implementation that netlib-java falls back on.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday
John, I have to disagree with you there. Dense matrices come up a lot in
industry,  although your personal experience may be different.
On 26 Mar 2015 16:20, "John Canny"  wrote:

>  I mentioned this earlier in the thread, but I'll put it out again. Dense
> BLAS are not very important for most machine learning workloads: at least
> for non-image workloads in industry (and for image processing you would
> probably want a deep learning/SGD solution with convolution kernels). e.g.
> it was only relevant for 1/7 of our recent benchmarks, which should be a
> reasonable sample. What really matters is sparse BLAS performance. BIDMat
> is still an order of magnitude faster there. Those kernels are only in
> BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data.
>
> Its also the case that the overall performance of an algorithm is
> determined by the slowest kernel, not the fastest. If the goal is to get
> closer to BIDMach's performance on typical problems, you need to make sure
> that every kernel goes at comparable speed. So the real question is how
> much faster MLLib routines do on a complete problem with/without GPU
> acceleration. For BIDMach, its close to a factor of 10. But that required
> running entirely on the GPU, and making sure every kernel is close to its
> limit.
>
> -John
>
> If you think nvblas would be helpful, you should try it in some end-to-end
> benchmarks.
> On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
>
> Yeah, much more reasonable - nice to know that we can get full GPU
> performance from breeze/netlib-java - meaning there's no compelling
> performance reason to switch out our current linear algebra library (at
> least as far as this benchmark is concerned).
>
>  Instead, it looks like a user guide for configuring Spark/MLlib to use
> the right BLAS library will get us most of the way there. Or, would it make
> sense to finally ship openblas compiled for some common platforms (64-bit
> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
> warnings once and for all for most users? (Licensing is BSD) Or am I
> missing something?
>
> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
>
>> As everyone suggested, the results were too good to be true, so I
>> double-checked them. It turns that nvblas did not do multiplication due to
>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
>> previously posted results with nvblas are matrices copying only. The
>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
>> handpicked other values that worked. As a result, netlib+nvblas is on par
>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
>> configuration.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>>
>>
>> -Original Message-
>> From: Ulanov, Alexander
>> Sent: Wednesday, March 25, 2015 2:31 PM
>> To: Sam Halliday
>>  Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
>> Sparks; jfcanny
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has
>> exceptional performance for big matrices with Double, faster than
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>> original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance of
>> different libraries. I just want to pick a library that does at best dense
>> matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran blas
>> functions, at the same time netlib-java uses C cblas functions. So, one
>> needs cblas shared library to use nvblas through netlib-java. Fedora does
>> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
>> could not use cblas from Atlas or Openblas because they link to their
>> implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -Original Message-
>> From: Ulanov, Alexander
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>> should replace current blas functions calls after executing LD_PRELOAD as
>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>> changes to netlib-java. It seems to work for simple Java example, but I
>> cannot ma

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread John Canny
I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at 
least for non-image workloads in industry (and for image processing you 
would probably want a deep learning/SGD solution with convolution 
kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, 
which should be a reasonable sample. What really matters is sparse BLAS 
performance. BIDMat is still an order of magnitude faster there. Those 
kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well 
on power-law data.


Its also the case that the overall performance of an algorithm is 
determined by the slowest kernel, not the fastest. If the goal is to get 
closer to BIDMach's performance on typical problems, you need to make 
sure that every kernel goes at comparable speed. So the real question is 
how much faster MLLib routines do on a complete problem with/without GPU 
acceleration. For BIDMach, its close to a factor of 10. But that 
required running entirely on the GPU, and making sure every kernel is 
close to its limit.


-John

If you think nvblas would be helpful, you should try it in some 
end-to-end benchmarks.

On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU 
performance from breeze/netlib-java - meaning there's no compelling 
performance reason to switch out our current linear algebra library 
(at least as far as this benchmark is concerned).


Instead, it looks like a user guide for configuring Spark/MLlib to use 
the right BLAS library will get us most of the way there. Or, would it 
make sense to finally ship openblas compiled for some common platforms 
(64-bit linux, windows, mac) directly with Spark - hopefully 
eliminating the jblas warnings once and for all for most users? 
(Licensing is BSD) Or am I missing something?


On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
mailto:alexander.ula...@hp.com>> wrote:


As everyone suggested, the results were too good to be true, so I
double-checked them. It turns that nvblas did not do
multiplication due to parameter NVBLAS_TILE_DIM from "nvblas.conf"
and returned zero matrix. My previously posted results with nvblas
are matrices copying only. The default NVBLAS_TILE_DIM==2048 is
too big for my graphic card/matrix size. I handpicked other values
that worked. As a result, netlib+nvblas is on par with
BIDMat-cuda. As promised, I am going to post a how-to for nvblas
configuration.


https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



-Original Message-
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org ; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi again,

I finally managed to use nvblas within Spark+netlib-java. It has
exceptional performance for big matrices with Double, faster than
BIDMat-cuda with Float. But for smaller matrices, if you will copy
them to/from GPU, OpenBlas or MKL might be a better choice. This
correlates with original nvblas presentation on GPU conf 2013
(slide 21):

http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

My results:

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Just in case, these tests are not for generalization of
performance of different libraries. I just want to pick a library
that does at best dense matrices multiplication for my task.

P.S. My previous issue with nvblas was the following: it has
Fortran blas functions, at the same time netlib-java uses C cblas
functions. So, one needs cblas shared library to use nvblas
through netlib-java. Fedora does not have cblas (but Debian and
Ubuntu have), so I needed to compile it. I could not use cblas
from Atlas or Openblas because they link to their implementation
and not to Fortran blas.

Best regards, Alexander

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org ; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra

Hi,

I am trying to use nvblas with netlib-java from Spark. nvblas
functions should replace current blas functions calls after
executing LD_PRELOAD as suggested in
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to
netlib-java. It seems to work for simple Java example, but I
cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Evan R. Sparks
Alright Sam - you are the expert here. If the GPL issues are unavoidable,
that's fine - what is the exact bit of code that is GPL?

The suggestion to use OpenBLAS is not to say it's the best option, but that
it's a *free, reasonable default* for many users - keep in mind the most
common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
Additionally, for many of the problems we're targeting, this reasonable
default can provide a 1-2 orders of magnitude improvement in performance
over the f2jblas implementation that netlib-java falls back on.

The JVM issues are trickier, I agree - so it sounds like a good user guide
explaining the tradeoffs and configurations procedures as they relate to
spark is a reasonable way forward.

[1] -
https://gigaom.com/2015/01/27/a-few-interesting-numbers-about-apache-spark/

On Thu, Mar 26, 2015 at 12:54 AM, Sam Halliday 
wrote:

> Btw, OpenBLAS requires GPL runtime binaries which are typically considered
> "system libraries" (and these fall under something similar to the Java
> classpath exception rule)... so it's basically impossible to distribute
> OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
> Spark right now to clear up something of this nature.
>
> On a more technical level, I'd recommend watching my talk at ScalaX which
> explains in detail why high performance only comes from machine optimised
> binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
> on the CPU, not OpenBLAS).
>
> On an even deeper level, using natives has consequences to JIT and GC
> which isn't suitable for everybody and we'd really like people to go into
> that with their eyes wide open.
> On 26 Mar 2015 07:43, "Sam Halliday"  wrote:
>
>> I'm not at all surprised ;-) I fully expect the GPU performance to get
>> better automatically as the hardware improves.
>>
>> Netlib natives still need to be shipped separately. I'd also oppose any
>> move to make Open BLAS the default - is not always better and I think
>> natives really need DevOps buy-in. It's not the right solution for
>> everybody.
>> On 26 Mar 2015 01:23, "Evan R. Sparks"  wrote:
>>
>>> Yeah, much more reasonable - nice to know that we can get full GPU
>>> performance from breeze/netlib-java - meaning there's no compelling
>>> performance reason to switch out our current linear algebra library (at
>>> least as far as this benchmark is concerned).
>>>
>>> Instead, it looks like a user guide for configuring Spark/MLlib to use
>>> the right BLAS library will get us most of the way there. Or, would it make
>>> sense to finally ship openblas compiled for some common platforms (64-bit
>>> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
>>> warnings once and for all for most users? (Licensing is BSD) Or am I
>>> missing something?
>>>
>>> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
>>> alexander.ula...@hp.com> wrote:
>>>
 As everyone suggested, the results were too good to be true, so I
 double-checked them. It turns that nvblas did not do multiplication due to
 parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
 previously posted results with nvblas are matrices copying only. The
 default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
 handpicked other values that worked. As a result, netlib+nvblas is on par
 with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
 configuration.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



 -Original Message-
 From: Ulanov, Alexander
 Sent: Wednesday, March 25, 2015 2:31 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
 Sparks; jfcanny
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran
 blas functions, at the same time netlib-java uses C cblas functions. So,
 one needs cblas shared library to use nvblas through netlib-java. Fedora
 does not ha

Re: functools.partial as UserDefinedFunction

2015-03-26 Thread Karlson

Hi,

I've filed a JIRA (https://issues.apache.org/jira/browse/SPARK-6553) and 
suggested a fix (https://github.com/apache/spark/pull/5206).



On 2015-03-25 19:49, Davies Liu wrote:

It’s good to support functools.partial, could you file a JIRA for it?


On Wednesday, March 25, 2015 at 5:42 AM, Karlson wrote:



Hi all,

passing a functools.partial-function as a UserDefinedFunction to
DataFrame.select raises an AttributeException, because 
functools.partial

does not have the attribute __name__. Is there any alternative to
relying on __name__ in pyspark/sql/functions.py:126 ?


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
(mailto:dev-unsubscr...@spark.apache.org)
For additional commands, e-mail: dev-h...@spark.apache.org 
(mailto:dev-h...@spark.apache.org)





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday
Btw, OpenBLAS requires GPL runtime binaries which are typically considered
"system libraries" (and these fall under something similar to the Java
classpath exception rule)... so it's basically impossible to distribute
OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
Spark right now to clear up something of this nature.

On a more technical level, I'd recommend watching my talk at ScalaX which
explains in detail why high performance only comes from machine optimised
binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
on the CPU, not OpenBLAS).

On an even deeper level, using natives has consequences to JIT and GC which
isn't suitable for everybody and we'd really like people to go into that
with their eyes wide open.
On 26 Mar 2015 07:43, "Sam Halliday"  wrote:

> I'm not at all surprised ;-) I fully expect the GPU performance to get
> better automatically as the hardware improves.
>
> Netlib natives still need to be shipped separately. I'd also oppose any
> move to make Open BLAS the default - is not always better and I think
> natives really need DevOps buy-in. It's not the right solution for
> everybody.
> On 26 Mar 2015 01:23, "Evan R. Sparks"  wrote:
>
>> Yeah, much more reasonable - nice to know that we can get full GPU
>> performance from breeze/netlib-java - meaning there's no compelling
>> performance reason to switch out our current linear algebra library (at
>> least as far as this benchmark is concerned).
>>
>> Instead, it looks like a user guide for configuring Spark/MLlib to use
>> the right BLAS library will get us most of the way there. Or, would it make
>> sense to finally ship openblas compiled for some common platforms (64-bit
>> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
>> warnings once and for all for most users? (Licensing is BSD) Or am I
>> missing something?
>>
>> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com> wrote:
>>
>>> As everyone suggested, the results were too good to be true, so I
>>> double-checked them. It turns that nvblas did not do multiplication due to
>>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
>>> previously posted results with nvblas are matrices copying only. The
>>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
>>> handpicked other values that worked. As a result, netlib+nvblas is on par
>>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
>>> configuration.
>>>
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>>
>>>
>>> -Original Message-
>>> From: Ulanov, Alexander
>>> Sent: Wednesday, March 25, 2015 2:31 PM
>>> To: Sam Halliday
>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
>>> Sparks; jfcanny
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi again,
>>>
>>> I finally managed to use nvblas within Spark+netlib-java. It has
>>> exceptional performance for big matrices with Double, faster than
>>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>>> original nvblas presentation on GPU conf 2013 (slide 21):
>>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>>
>>> My results:
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> Just in case, these tests are not for generalization of performance of
>>> different libraries. I just want to pick a library that does at best dense
>>> matrices multiplication for my task.
>>>
>>> P.S. My previous issue with nvblas was the following: it has Fortran
>>> blas functions, at the same time netlib-java uses C cblas functions. So,
>>> one needs cblas shared library to use nvblas through netlib-java. Fedora
>>> does not have cblas (but Debian and Ubuntu have), so I needed to compile
>>> it. I could not use cblas from Atlas or Openblas because they link to their
>>> implementation and not to Fortran blas.
>>>
>>> Best regards, Alexander
>>>
>>> -Original Message-
>>> From: Ulanov, Alexander
>>> Sent: Tuesday, March 24, 2015 6:57 PM
>>> To: Sam Halliday
>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Hi,
>>>
>>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>>> should replace current blas functions calls after executing LD_PRELOAD as
>>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>>> changes to netlib-java. It seems to work for simple Java example, but I
>>> cannot make it work with Spark. I run the following:
>>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday
I'm not at all surprised ;-) I fully expect the GPU performance to get
better automatically as the hardware improves.

Netlib natives still need to be shipped separately. I'd also oppose any
move to make Open BLAS the default - is not always better and I think
natives really need DevOps buy-in. It's not the right solution for
everybody.
On 26 Mar 2015 01:23, "Evan R. Sparks"  wrote:

> Yeah, much more reasonable - nice to know that we can get full GPU
> performance from breeze/netlib-java - meaning there's no compelling
> performance reason to switch out our current linear algebra library (at
> least as far as this benchmark is concerned).
>
> Instead, it looks like a user guide for configuring Spark/MLlib to use the
> right BLAS library will get us most of the way there. Or, would it make
> sense to finally ship openblas compiled for some common platforms (64-bit
> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
> warnings once and for all for most users? (Licensing is BSD) Or am I
> missing something?
>
> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
>
>> As everyone suggested, the results were too good to be true, so I
>> double-checked them. It turns that nvblas did not do multiplication due to
>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
>> previously posted results with nvblas are matrices copying only. The
>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
>> handpicked other values that worked. As a result, netlib+nvblas is on par
>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
>> configuration.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>>
>>
>> -Original Message-
>> From: Ulanov, Alexander
>> Sent: Wednesday, March 25, 2015 2:31 PM
>> To: Sam Halliday
>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks;
>> jfcanny
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has
>> exceptional performance for big matrices with Double, faster than
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>> original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance of
>> different libraries. I just want to pick a library that does at best dense
>> matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran blas
>> functions, at the same time netlib-java uses C cblas functions. So, one
>> needs cblas shared library to use nvblas through netlib-java. Fedora does
>> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
>> could not use cblas from Atlas or Openblas because they link to their
>> implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -Original Message-
>> From: Ulanov, Alexander
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>> should replace current blas functions calls after executing LD_PRELOAD as
>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>> changes to netlib-java. It seems to work for simple Java example, but I
>> cannot make it work with Spark. I run the following:
>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>
>> +-+
>> | Processes:   GPU
>> Memory |
>> |  GPU   PID  Type  Process name   Usage
>> |
>>
>> |=|
>> |0  8873C   bash
>> 39MiB |
>> |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
>> 39MiB |
>>
>> +-+
>>
>> In Spark shell I do matrix multiplication and see the following:
>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>> So I am sure that netlib-native is loaded and cblas supposedly used.
>> However, matrix 

Can I call aggregate UDF in DataFrame?

2015-03-26 Thread Haopu Wang
Specifically there are only 5 aggregate functions in class
org.apache.spark.sql.GroupedData: sum/max/min/mean/count.

Can I plugin a function to calculate stddev?

Thank you!


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org