Re: Unit test logs in Jenkins?

2015-04-01 Thread Patrick Wendell
Hey Marcelo,

Great question. Right now, some of the more active developers have an
account that allows them to log into this cluster to inspect logs (we
copy the logs from each run to a node on that cluster). The
infrastructure is maintained by the AMPLab.

I will put you in touch the someone there who can get you an account.

This is a short term solution. The longer term solution is to have
these scp'd regularly to an S3 bucket or somewhere people can get
access to them, but that's not ready yet.

- Patrick



On Wed, Apr 1, 2015 at 1:01 PM, Marcelo Vanzin  wrote:
> Hey all,
>
> Is there a way to access unit test logs in jenkins builds? e.g.,
> core/target/unit-tests.log
>
> That would be really helpful to debug build failures. The scalatest
> output isn't all that helpful.
>
> If that's currently not available, would it be possible to add those
> logs as build artifacts?
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you!

 



From: Reynold Xin [mailto:r...@databricks.com] 
Sent: Thursday, April 02, 2015 8:11 AM
To: Haopu Wang
Cc: user; dev@spark.apache.org
Subject: Re: Can I call aggregate UDF in DataFrame?

 

You totally can.

 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/
apache/spark/sql/DataFrame.scala#L792

 

There is also an attempt at adding stddev here already:
https://github.com/apache/spark/pull/5228

 

 

 

On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang 
wrote:

Specifically there are only 5 aggregate functions in class
org.apache.spark.sql.GroupedData: sum/max/min/mean/count.

Can I plugin a function to calculate stddev?

Thank you!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 



Re: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Reynold Xin
You totally can.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L792

There is also an attempt at adding stddev here already:
https://github.com/apache/spark/pull/5228



On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang  wrote:

> Specifically there are only 5 aggregate functions in class
> org.apache.spark.sql.GroupedData: sum/max/min/mean/count.
>
> Can I plugin a function to calculate stddev?
>
> Thank you!
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Volunteers for Spark MOOCs

2015-04-01 Thread Ameet Talwalkar
Dear Spark Devs,

Anthony Joseph and I are teaching two large MOOCs
 this summer on
Apache Spark and we are looking for participants from the community who
would like to help us administer the course.


Anthony is a Professor in Computer Science at UC Berkeley in the AMPLab,
and his course

will be an introduction to big data analysis using Spark.  I am an
Assistant Professor in Computer Science at UCLA (and a former AMPLab
post-doc), and my course

will be about using Spark for Machine Learning pipelines.  The courses will
be taught in Python, and will be freely available via edX, a non-profit
MOOC provider that partners with many top universities across the world.

We're looking for volunteer Teaching Assistants (TAs) with at least two of
the following skills: ability to deal with Spark or Python setup issues,
basic Spark programming and debugging experience, basic ML knowledge, and
basic operations skills (writing and using scripts, helping with
username/password issues, etc.).  TAing is a great opportunity to interact
with a wide audience of Spark enthusiasts, and TAs will be formerly listed
as part of the course staff on the edX website.

We are looking for a time-commitment of roughly 10hrs per week in May and
at least 20 hours a week in June and July, though we are quite flexible
about specific working hours and location (since we will have students
around the world). We can also offer a stipend.

Please contact Anthony and me directly if you are interested.

Thanks,

Ameet and Anthony


Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Burak Yavuz
This is awesome! I can write the apps for it, to make the Web UI more
functional!

On Wed, Apr 1, 2015 at 12:37 AM, Tathagata Das 
wrote:

> This is a significant effort that Reynold has undertaken, and I am super
> glad to see that it's finally taking a concrete form. Would love to see
> what the community thinks about the idea.
>
> TD
>
> On Wed, Apr 1, 2015 at 3:11 AM, Reynold Xin  wrote:
>
> > Hi Spark devs,
> >
> > I've spent the last few months investigating the feasibility of
> > re-architecting Spark for mobile platforms, considering the growing
> > population of Android/iOS users. I'm happy to share with you my findings
> at
> > https://issues.apache.org/jira/browse/SPARK-6646
> >
> > The tl;dr is that we should support running Spark on Android/iOS, and the
> > best way to do this at the moment is to use Scala.js to compile Spark
> code
> > into JavaScript, and then run it in Safari or Chrome (and even node.js
> > potentially for servers).
> >
> > If you are on your phones right now and prefer reading a blog post rather
> > than a PDF file, you can read more about the design doc at
> >
> >
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html
> >
> >
> > This is done in collaboration with TD, Xiangrui, Patrick. Look forward to
> > your feedback!
> >
>


RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
Jeremy, thanks for explanation!
What if instead you've used Parquet file format? You can still write a number 
of small files as you do, but you don't have to implement a writer/reader, 
because they are available for Parquet in various languages.

From: Jeremy Freeman [mailto:freeman.jer...@gmail.com]
Sent: Wednesday, April 01, 2015 1:37 PM
To: Hector Yee
Cc: Ulanov, Alexander; Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

@Alexander, re: using flat binary and metadata, you raise excellent points! At 
least in our case, we decided on a specific endianness, but do end up storing 
some extremely minimal specification in a JSON file, and have written importers 
and exporters within our library to parse it. While it does feel a little like 
reinvention, it's fast, direct, and scalable, and seems pretty sensible if you 
know your data will be dense arrays of numerical features.

-
jeremyfreeman.net
@thefreemanlab

On Apr 1, 2015, at 3:52 PM, Hector Yee 
mailto:hector@gmail.com>> wrote:


Just using sc.textfile then a .map(decode)
Yes by default it is multiple files .. our training data is 1TB gzipped
into 5000 shards.

On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander 
mailto:alexander.ula...@hp.com>>
wrote:


Thanks, sounds interesting! How do you load files to Spark? Did you
consider having multiple files instead of file lines?



*From:* Hector Yee [mailto:hector@gmail.com]
*Sent:* Wednesday, April 01, 2015 11:36 AM
*To:* Ulanov, Alexander
*Cc:* Evan R. Sparks; Stephen Boesch; 
dev@spark.apache.org

*Subject:* Re: Storing large data for MLlib machine learning



I use Thrift and then base64 encode the binary and save it as text file
lines that are snappy or gzip encoded.



It makes it very easy to copy small chunks locally and play with subsets
of the data and not have dependencies on HDFS / hadoop for server stuff for
example.





On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <
alexander.ula...@hp.com> wrote:

Thanks, Evan. What do you think about Protobuf? Twitter has a library to
manage protobuf files in hdfshttps://github.com/twitter/elephant-bird


From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related
projects using it already. Keep in mind that it's column-oriented which
might impact performance - but basically you're going to want your features
in a byte array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch 
mailto:java...@gmail.com>mailto:java...@gmail.com>>> wrote:
There are some convenience methods you might consider including:

  MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
mailto:alexander.ula...@hp.com>
>:



Hi,

Could you suggest what would be the reasonable file format to store
feature vector data for machine learning in Spark MLlib? Are there any
best

practices for Spark?

My data is dense feature vectors with labels. Some of the requirements
are

that the format should be easy loaded/serialized, randomly accessible,
with

a small footprint (binary). I am considering Parquet, hdf5, protocol
buffer

(protobuf), but I have little to no experience with them, so any
suggestions would be really appreciated.

Best regards, Alexander





--

Yee Yang Li Hector 

*google.com/+HectorYee 
*



--
Yee Yang Li Hector 
*google.com/+HectorYee 
*



Re: Storing large data for MLlib machine learning

2015-04-01 Thread Jeremy Freeman
@Alexander, re: using flat binary and metadata, you raise excellent points! At 
least in our case, we decided on a specific endianness, but do end up storing 
some extremely minimal specification in a JSON file, and have written importers 
and exporters within our library to parse it. While it does feel a little like 
reinvention, it’s fast, direct, and scalable, and seems pretty sensible if you 
know your data will be dense arrays of numerical features.

-
jeremyfreeman.net
@thefreemanlab

On Apr 1, 2015, at 3:52 PM, Hector Yee  wrote:

> Just using sc.textfile then a .map(decode)
> Yes by default it is multiple files .. our training data is 1TB gzipped
> into 5000 shards.
> 
> On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander 
> wrote:
> 
>> Thanks, sounds interesting! How do you load files to Spark? Did you
>> consider having multiple files instead of file lines?
>> 
>> 
>> 
>> *From:* Hector Yee [mailto:hector@gmail.com]
>> *Sent:* Wednesday, April 01, 2015 11:36 AM
>> *To:* Ulanov, Alexander
>> *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
>> 
>> *Subject:* Re: Storing large data for MLlib machine learning
>> 
>> 
>> 
>> I use Thrift and then base64 encode the binary and save it as text file
>> lines that are snappy or gzip encoded.
>> 
>> 
>> 
>> It makes it very easy to copy small chunks locally and play with subsets
>> of the data and not have dependencies on HDFS / hadoop for server stuff for
>> example.
>> 
>> 
>> 
>> 
>> 
>> On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com> wrote:
>> 
>> Thanks, Evan. What do you think about Protobuf? Twitter has a library to
>> manage protobuf files in hdfshttps://github.com/twitter/elephant-bird
>> 
>> 
>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
>> Sent: Thursday, March 26, 2015 2:34 PM
>> To: Stephen Boesch
>> Cc: Ulanov, Alexander; dev@spark.apache.org
>> Subject: Re: Storing large data for MLlib machine learning
>> 
>> On binary file formats - I looked at HDF5+Spark a couple of years ago and
>> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
>> needed filenames as input, you couldn't pass it anything like an
>> InputStream). I don't know if it has gotten any better.
>> 
>> Parquet plays much more nicely and there are lots of spark-related
>> projects using it already. Keep in mind that it's column-oriented which
>> might impact performance - but basically you're going to want your features
>> in a byte array and deser should be pretty straightforward.
>> 
>> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch > java...@gmail.com>> wrote:
>> There are some convenience methods you might consider including:
>> 
>>   MLUtils.loadLibSVMFile
>> 
>> and   MLUtils.loadLabeledPoint
>> 
>> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander > >:
>> 
>> 
>>> Hi,
>>> 
>>> Could you suggest what would be the reasonable file format to store
>>> feature vector data for machine learning in Spark MLlib? Are there any
>> best
>>> practices for Spark?
>>> 
>>> My data is dense feature vectors with labels. Some of the requirements
>> are
>>> that the format should be easy loaded/serialized, randomly accessible,
>> with
>>> a small footprint (binary). I am considering Parquet, hdf5, protocol
>> buffer
>>> (protobuf), but I have little to no experience with them, so any
>>> suggestions would be really appreciated.
>>> 
>>> Best regards, Alexander
>>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> 
>> Yee Yang Li Hector 
>> 
>> *google.com/+HectorYee *
>> 
> 
> 
> 
> -- 
> Yee Yang Li Hector 
> *google.com/+HectorYee *



Unit test logs in Jenkins?

2015-04-01 Thread Marcelo Vanzin
Hey all,

Is there a way to access unit test logs in jenkins builds? e.g.,
core/target/unit-tests.log

That would be really helpful to debug build failures. The scalatest
output isn't all that helpful.

If that's currently not available, would it be possible to add those
logs as build artifacts?

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
Just using sc.textfile then a .map(decode)
Yes by default it is multiple files .. our training data is 1TB gzipped
into 5000 shards.

On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander 
wrote:

>  Thanks, sounds interesting! How do you load files to Spark? Did you
> consider having multiple files instead of file lines?
>
>
>
> *From:* Hector Yee [mailto:hector@gmail.com]
> *Sent:* Wednesday, April 01, 2015 11:36 AM
> *To:* Ulanov, Alexander
> *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
>
> *Subject:* Re: Storing large data for MLlib machine learning
>
>
>
> I use Thrift and then base64 encode the binary and save it as text file
> lines that are snappy or gzip encoded.
>
>
>
> It makes it very easy to copy small chunks locally and play with subsets
> of the data and not have dependencies on HDFS / hadoop for server stuff for
> example.
>
>
>
>
>
> On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
>
> Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
> Sent: Thursday, March 26, 2015 2:34 PM
> To: Stephen Boesch
> Cc: Ulanov, Alexander; dev@spark.apache.org
> Subject: Re: Storing large data for MLlib machine learning
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch  java...@gmail.com>> wrote:
> There are some convenience methods you might consider including:
>
>MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander  >:
>
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>
>
>
>
> --
>
> Yee Yang Li Hector 
>
> *google.com/+HectorYee *
>



-- 
Yee Yang Li Hector 
*google.com/+HectorYee *


RE: Using CUDA within Spark / boosting linear algebra

2015-04-01 Thread Ulanov, Alexander
FYI, I've added instructions to Netlib-java wiki, Sam added the link to them 
from the project's readme.md
https://github.com/fommil/netlib-java/wiki/NVBLAS

Best regards, Alexander
-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Monday, March 30, 2015 2:43 PM
To: Sean Owen
Cc: Evan R. Sparks; Sam Halliday; dev@spark.apache.org; Ulanov, Alexander; 
jfcanny
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alex,

Since it is non-trivial to make nvblas work with netlib-java, it would be great 
if you can send the instructions to netlib-java as part of the README. 
Hopefully we don't need to modify netlib-java code to use nvblas.

Best,
Xiangrui

On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen  wrote:
> The license issue is with libgfortran, rather than OpenBLAS.
>
> (FWIW I am going through the motions to get OpenBLAS set up by default 
> on CDH in the near future, and the hard part is just handling
> libgfortran.)
>
> On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks  wrote:
>> Alright Sam - you are the expert here. If the GPL issues are 
>> unavoidable, that's fine - what is the exact bit of code that is GPL?
>>
>> The suggestion to use OpenBLAS is not to say it's the best option, 
>> but that it's a *free, reasonable default* for many users - keep in 
>> mind the most common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
>> Additionally, for many of the problems we're targeting, this 
>> reasonable default can provide a 1-2 orders of magnitude improvement 
>> in performance over the f2jblas implementation that netlib-java falls back 
>> on.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For 
> additional commands, e-mail: dev-h...@spark.apache.org
>


RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
Thanks, sounds interesting! How do you load files to Spark? Did you consider 
having multiple files instead of file lines?

From: Hector Yee [mailto:hector@gmail.com]
Sent: Wednesday, April 01, 2015 11:36 AM
To: Ulanov, Alexander
Cc: Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

I use Thrift and then base64 encode the binary and save it as text file lines 
that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets of the 
data and not have dependencies on HDFS / hadoop for server stuff for example.


On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander 
mailto:alexander.ula...@hp.com>> wrote:
Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage 
protobuf files in hdfs https://github.com/twitter/elephant-bird


From: Evan R. Sparks 
[mailto:evan.spa...@gmail.com]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and found 
it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed 
filenames as input, you couldn't pass it anything like an InputStream). I don't 
know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects 
using it already. Keep in mind that it's column-oriented which might impact 
performance - but basically you're going to want your features in a byte array 
and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch 
mailto:java...@gmail.com>>>
 wrote:
There are some convenience methods you might consider including:

   MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
mailto:alexander.ula...@hp.com>>>:

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>



--
Yee Yang Li Hector
google.com/+HectorYee


Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
I use Thrift and then base64 encode the binary and save it as text file
lines that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets of
the data and not have dependencies on HDFS / hadoop for server stuff for
example.


On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander 
wrote:

> Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
> Sent: Thursday, March 26, 2015 2:34 PM
> To: Stephen Boesch
> Cc: Ulanov, Alexander; dev@spark.apache.org
> Subject: Re: Storing large data for MLlib machine learning
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch  java...@gmail.com>> wrote:
> There are some convenience methods you might consider including:
>
>MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander  >:
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>


-- 
Yee Yang Li Hector 
*google.com/+HectorYee *


RE: Stochastic gradient descent performance

2015-04-01 Thread Ulanov, Alexander
Sorry for bothering you again, but I think that it is an important issue for 
applicability of SGD in Spark MLlib. Could Spark developers please comment on 
it.

-Original Message-
From: Ulanov, Alexander 
Sent: Monday, March 30, 2015 5:00 PM
To: dev@spark.apache.org
Subject: Stochastic gradient descent performance

Hi,

It seems to me that there is an overhead in "runMiniBatchSGD" function of 
MLlib's "GradientDescent". In particular, "sample" and "treeAggregate" might 
take time that is order of magnitude greater than the actual gradient 
computation. In particular, for mnist dataset of 60K instances, minibatch size 
= 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to aggregate in 
local mode with 1 data partition on Core i5 processor. The actual gradient 
computation takes 0.002 s. I searched through Spark Jira and found that there 
was recently an update for more efficient sampling (SPARK-3250) that is already 
included in Spark codebase. Is there a way to reduce the sampling time and 
local treeRedeuce by order of magnitude?

Best regards, Alexander

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: One corrupt gzip in a directory of 100s

2015-04-01 Thread Ted Yu
bq. writing the output (to Amazon S3) failed

What's the value of "fs.s3.maxRetries" ?
Increasing the value should help.

Cheers

On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman  wrote:

> What about communication errors and not corrupted files?
> Both when reading input and when writing output.
> We currently experience a failure of the entire process, if the last stage
> of writing the output (to Amazon S3) failed because of a very temporary DNS
> resolution issue (easily resolved by retrying).
>
> *Romi Kuntsman*, *Big Data Engineer*
>  http://www.totango.com
>
> On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik  wrote:
>
> > I actually saw the same issue, where we analyzed some container with few
> > hundreds of GBs zip files - one was corrupted and Spark exit with
> > Exception on the entire job.
> > I like SPARK-6593, since it  can cover also additional cases, not just in
> > case of corrupted zip files.
> >
> >
> >
> > From:   Dale Richardson 
> > To: "dev@spark.apache.org" 
> > Date:   29/03/2015 11:48 PM
> > Subject:One corrupt gzip in a directory of 100s
> >
> >
> >
> > Recently had an incident reported to me where somebody was analysing a
> > directory of gzipped log files, and was struggling to load them into
> spark
> > because one of the files was corrupted - calling
> > sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
> > executor that was reading that file, which caused the entire job to be
> > cancelled after the retry count was exceeded, without any way of catching
> > and recovering from the error.  While normally I think it is entirely
> > appropriate to stop execution if something is wrong with your input,
> > sometimes it is useful to analyse what you can get (as long as you are
> > aware that input has been skipped), and treat corrupt files as acceptable
> > losses.
> > To cater for this particular case I've added SPARK-6593 (PR at
> > https://github.com/apache/spark/pull/5250). Which adds an option
> > (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
> > Input format, but to continue on with the next task.
> > Ideally in this case you would want to report the corrupt file paths back
> > to the master so they could be dealt with in a particular way (eg moved
> to
> > a separate directory), but that would require a public API
> > change/addition. I was pondering on an addition to Spark's hadoop API
> that
> > could report processing status back to the master via an optional
> > accumulator that collects filepath/Option(exception message) tuples so
> the
> > user has some idea of what files are being processed, and what files are
> > being skipped.
> > Regards,Dale.
> >
>


Re: One corrupt gzip in a directory of 100s

2015-04-01 Thread Romi Kuntsman
What about communication errors and not corrupted files?
Both when reading input and when writing output.
We currently experience a failure of the entire process, if the last stage
of writing the output (to Amazon S3) failed because of a very temporary DNS
resolution issue (easily resolved by retrying).

*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com

On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik  wrote:

> I actually saw the same issue, where we analyzed some container with few
> hundreds of GBs zip files - one was corrupted and Spark exit with
> Exception on the entire job.
> I like SPARK-6593, since it  can cover also additional cases, not just in
> case of corrupted zip files.
>
>
>
> From:   Dale Richardson 
> To: "dev@spark.apache.org" 
> Date:   29/03/2015 11:48 PM
> Subject:One corrupt gzip in a directory of 100s
>
>
>
> Recently had an incident reported to me where somebody was analysing a
> directory of gzipped log files, and was struggling to load them into spark
> because one of the files was corrupted - calling
> sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
> executor that was reading that file, which caused the entire job to be
> cancelled after the retry count was exceeded, without any way of catching
> and recovering from the error.  While normally I think it is entirely
> appropriate to stop execution if something is wrong with your input,
> sometimes it is useful to analyse what you can get (as long as you are
> aware that input has been skipped), and treat corrupt files as acceptable
> losses.
> To cater for this particular case I've added SPARK-6593 (PR at
> https://github.com/apache/spark/pull/5250). Which adds an option
> (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
> Input format, but to continue on with the next task.
> Ideally in this case you would want to report the corrupt file paths back
> to the master so they could be dealt with in a particular way (eg moved to
> a separate directory), but that would require a public API
> change/addition. I was pondering on an addition to Spark's hadoop API that
> could report processing status back to the master via an optional
> accumulator that collects filepath/Option(exception message) tuples so the
> user has some idea of what files are being processed, and what files are
> being skipped.
> Regards,Dale.
>


Re: One corrupt gzip in a directory of 100s

2015-04-01 Thread Gil Vernik
I actually saw the same issue, where we analyzed some container with few 
hundreds of GBs zip files - one was corrupted and Spark exit with 
Exception on the entire job.
I like SPARK-6593, since it  can cover also additional cases, not just in 
case of corrupted zip files.



From:   Dale Richardson 
To: "dev@spark.apache.org" 
Date:   29/03/2015 11:48 PM
Subject:One corrupt gzip in a directory of 100s



Recently had an incident reported to me where somebody was analysing a 
directory of gzipped log files, and was struggling to load them into spark 
because one of the files was corrupted - calling 
sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular 
executor that was reading that file, which caused the entire job to be 
cancelled after the retry count was exceeded, without any way of catching 
and recovering from the error.  While normally I think it is entirely 
appropriate to stop execution if something is wrong with your input, 
sometimes it is useful to analyse what you can get (as long as you are 
aware that input has been skipped), and treat corrupt files as acceptable 
losses.
To cater for this particular case I've added SPARK-6593 (PR at 
https://github.com/apache/spark/pull/5250). Which adds an option 
(spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop 
Input format, but to continue on with the next task.
Ideally in this case you would want to report the corrupt file paths back 
to the master so they could be dealt with in a particular way (eg moved to 
a separate directory), but that would require a public API 
change/addition. I was pondering on an addition to Spark's hadoop API that 
could report processing status back to the master via an optional 
accumulator that collects filepath/Option(exception message) tuples so the 
user has some idea of what files are being processed, and what files are 
being skipped.
Regards,Dale.  


Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Tathagata Das
This is a significant effort that Reynold has undertaken, and I am super
glad to see that it's finally taking a concrete form. Would love to see
what the community thinks about the idea.

TD

On Wed, Apr 1, 2015 at 3:11 AM, Reynold Xin  wrote:

> Hi Spark devs,
>
> I've spent the last few months investigating the feasibility of
> re-architecting Spark for mobile platforms, considering the growing
> population of Android/iOS users. I'm happy to share with you my findings at
> https://issues.apache.org/jira/browse/SPARK-6646
>
> The tl;dr is that we should support running Spark on Android/iOS, and the
> best way to do this at the moment is to use Scala.js to compile Spark code
> into JavaScript, and then run it in Safari or Chrome (and even node.js
> potentially for servers).
>
> If you are on your phones right now and prefer reading a blog post rather
> than a PDF file, you can read more about the design doc at
>
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html
>
>
> This is done in collaboration with TD, Xiangrui, Patrick. Look forward to
> your feedback!
>


Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Kushal Datta
Reynold, what's the idea behind using LLVM?

On Wed, Apr 1, 2015 at 12:31 AM, Akhil Das 
wrote:

> Nice try :)
>
> Thanks
> Best Regards
>
> On Wed, Apr 1, 2015 at 12:41 PM, Reynold Xin  wrote:
>
> > Hi Spark devs,
> >
> > I've spent the last few months investigating the feasibility of
> > re-architecting Spark for mobile platforms, considering the growing
> > population of Android/iOS users. I'm happy to share with you my findings
> at
> > https://issues.apache.org/jira/browse/SPARK-6646
> >
> > The tl;dr is that we should support running Spark on Android/iOS, and the
> > best way to do this at the moment is to use Scala.js to compile Spark
> code
> > into JavaScript, and then run it in Safari or Chrome (and even node.js
> > potentially for servers).
> >
> > If you are on your phones right now and prefer reading a blog post rather
> > than a PDF file, you can read more about the design doc at
> >
> >
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html
> >
> >
> > This is done in collaboration with TD, Xiangrui, Patrick. Look forward to
> > your feedback!
> >
>


Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Akhil Das
Nice try :)

Thanks
Best Regards

On Wed, Apr 1, 2015 at 12:41 PM, Reynold Xin  wrote:

> Hi Spark devs,
>
> I've spent the last few months investigating the feasibility of
> re-architecting Spark for mobile platforms, considering the growing
> population of Android/iOS users. I'm happy to share with you my findings at
> https://issues.apache.org/jira/browse/SPARK-6646
>
> The tl;dr is that we should support running Spark on Android/iOS, and the
> best way to do this at the moment is to use Scala.js to compile Spark code
> into JavaScript, and then run it in Safari or Chrome (and even node.js
> potentially for servers).
>
> If you are on your phones right now and prefer reading a blog post rather
> than a PDF file, you can read more about the design doc at
>
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html
>
>
> This is done in collaboration with TD, Xiangrui, Patrick. Look forward to
> your feedback!
>


Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Reynold Xin
Hi Spark devs,

I've spent the last few months investigating the feasibility of
re-architecting Spark for mobile platforms, considering the growing
population of Android/iOS users. I'm happy to share with you my findings at
https://issues.apache.org/jira/browse/SPARK-6646

The tl;dr is that we should support running Spark on Android/iOS, and the
best way to do this at the moment is to use Scala.js to compile Spark code
into JavaScript, and then run it in Safari or Chrome (and even node.js
potentially for servers).

If you are on your phones right now and prefer reading a blog post rather
than a PDF file, you can read more about the design doc at
https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html


This is done in collaboration with TD, Xiangrui, Patrick. Look forward to
your feedback!