Re: Get size of intermediate results

2016-10-20 Thread Egor Pahomov
I needed the same for debugging and I just added "count" action in debug
mode for every step I was interested in. It's very time-consuming, but I
debug not very often.

2016-10-20 2:17 GMT-07:00 Andreas Hechenberger :

> Hey awesome Spark-Dev's :)
>
> i am new to spark and i read a lot but now i am stuck :( so please be
> kind, if i ask silly questions.
>
> I want to analyze some algorithms and strategies in spark and for one
> experiment i want to know the size of the intermediate results between
> iterations/jobs. Some of them are written to disk and some are in the
> cache, i guess. I am not afraid of looking into the code (i already did)
> but its complex and have no clue where to start :( It would be nice if
> someone can point me in the right direction or where i can find more
> information about the structure of spark core devel :)
>
> I already setup the devel environment and i can compile spark. It was
> really awesome how smoothly the setup was :) Thx for that.
>
> Servus
> Andy
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 


*Sincerely yoursEgor Pakhomov*


Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-27 Thread Egor Pahomov
-1 : SPARK-16228 [SQL]  - "Percentile" needs explicit cast to double,
otherwise it throws an error. I can not move my existing 100500 quires to
2.0 transparently.

2016-06-24 11:52 GMT-07:00 Matt Cheah :

> -1 because of SPARK-16181 which is a correctness regression from 1.6.
> Looks like the patch is ready though:
> https://github.com/apache/spark/pull/13884 – it would be ideal for this
> patch to make it into the release.
>
> -Matt Cheah
>
> From: Nick Pentreath 
> Date: Friday, June 24, 2016 at 4:37 AM
> To: "dev@spark.apache.org" 
> Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC1)
>
> I'm getting the following when trying to run ./dev/run-tests (not
> happening on master) from the extracted source tar. Anyone else seeing
> this?
>
> error: Could not access 'fc0a1475ef'
> **
> File "./dev/run-tests.py", line 69, in
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name
> 
> for x in determine_modules_for_files(
> identify_changed_files_from_git_commits("fc0a1475ef",
> target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315,
> in __run
> compileflags, 1) in test.globs
>   File " __main__.identify_changed_files_from_git_commits[0]>", line 1, in 
> [x.name
> 
> for x in determine_modules_for_files(
> identify_changed_files_from_git_commits("fc0a1475ef",
> target_ref="5da21f07"))]
>   File "./dev/run-tests.py", line 86, in
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573,
> in check_output
> raise CalledProcessError(retcode, cmd, output=output)
> CalledProcessError: Command '['git', 'diff', '--name-only',
> 'fc0a1475ef', '5da21f07']' returned non-zero exit status 1
> error: Could not access '50a0496a43'
> **
> File "./dev/run-tests.py", line 71, in
> __main__.identify_changed_files_from_git_commits
> Failed example:
> 'root' in [x.name
> 
> for x in determine_modules_for_files(
>  identify_changed_files_from_git_commits("50a0496a43",
> target_ref="6765ef9"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315,
> in __run
> compileflags, 1) in test.globs
>   File " __main__.identify_changed_files_from_git_commits[1]>", line 1, in 
> 'root' in [x.name
> 
> for x in determine_modules_for_files(
>  identify_changed_files_from_git_commits("50a0496a43",
> target_ref="6765ef9"))]
>   File "./dev/run-tests.py", line 86, in
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573,
> in check_output
> raise CalledProcessError(retcode, cmd, output=output)
> CalledProcessError: Command '['git', 'diff', '--name-only',
> '50a0496a43', '6765ef9']' returned non-zero exit status 1
> **
> 1 items had failures:
>2 of   2 in __main__.identify_changed_files_from_git_commits
> ***Test Failed*** 2 failures.
>
>
>
> On Fri, 24 Jun 2016 at 06:59 Yin Huai  wrote:
>
>> -1 because of https://issues.apache.org/jira/browse/SPARK-16121
>> .
>>
>>
>> This jira was resolved after 2.0.0-RC1 was cut. Without the fix, Spark
>> SQL effectively only uses the driver to list files when loading datasets
>> and the driver-side file 

Re: Spark Assembly jar ?

2016-06-14 Thread Egor Pahomov
It's strange for me, that having and support fat jar was never a important
thing. We have next scenario - we have big application, where spark is just
another library for data processing. So we can not create small jar and
feed it to spark scripts - we need to call spark from application. And
having fat jar as maven dependency is perfect. We have some spark installed
on cluster(whatever cloudera put there), but often we need to patch spark
for our needs, so we need to bring everything with us. Different
departments use different spark versions - so we can not share jars on
cluster easily. Yep, there are some disadvantages, but flexibility of
changing spark process and deploying overcome these disadvantages.

So we probably would patch pom's as usual to create fat jar.

2016-06-14 12:23 GMT-07:00 Reynold Xin :

> You just need to run normal packaging and all the scripts are now setup to
> run without the assembly jars.
>
>
> On Tuesday, June 14, 2016, Franklyn D'souza 
> wrote:
>
>> Just wondering where the spark-assembly jar has gone in 2.0. i've been
>> reading that its been removed but i'm not sure what the new workflow is .
>>
>


-- 


*Sincerely yoursEgor Pakhomov*


Return binary mode in ThriftServer

2016-06-13 Thread Egor Pahomov
In May due to the SPARK-15095 binary mode was "removed" (code is there, but
you can not turn it on) from Spark-2.0. In 1.6.1 binary was default and in
2.0.0-preview it was removed. It's really annoying:

   - I can not use Tableau+Spark anymore
   - I need to change connection URL in SQL client for every analyst in my
   organization. And with Squirrel I experiencing problems with that.
   - We have parts of infrastructure, which connected to data
   infrastructure though ThriftServer. And of course format was binary.

I've created a ticket to get binary back(
https://issues.apache.org/jira/browse/SPARK-15934), but that's not the
point. I've experienced this problem a month ago, but haven't done anything
about it, because I believed, that I'm stupid and doing something wrong.
But documentation was release recently and it contained no information
about this new thing and it made me digging.

Most of what I describe is just annoying, but Tableau+Spark new
incompatibility I believe is big deal. Maybe I'm wrong and there are ways
to make things work, it's just I wouldn't expect move to 2.0.0 to be so
time consuming.

My point: Do we have any guidelines regarding doing such radical things?

-- 


*Sincerely yoursEgor Pakhomov*


Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-06 Thread Egor Pahomov
+1

Spark ODBC server is fine, SQL is fine.

2016-03-03 12:09 GMT-08:00 Yin Yang :

> Skipping docker tests, the rest are green:
>
> [INFO] Spark Project External Kafka ... SUCCESS [01:28
> min]
> [INFO] Spark Project Examples . SUCCESS [02:59
> min]
> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
> 11.680 s]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 02:16 h
> [INFO] Finished at: 2016-03-03T11:17:07-08:00
> [INFO] Final Memory: 152M/4062M
>
> On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang  wrote:
>
>> When I ran test suite using the following command:
>>
>> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
>> -Dhadoop.version=2.7.0 package
>>
>> I got failure in Spark Project Docker Integration Tests :
>>
>> 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator: Remote
>> daemon shut down; proceeding with flushing remote transports.
>> ^[[31m*** RUN ABORTED ***^[[0m
>> ^[[31m  com.spotify.docker.client.DockerException:
>> java.util.concurrent.ExecutionException:
>> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: java.io.
>>IOException: No such file or directory^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
>> ^[[31m  at
>> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
>> ^[[31m  at
>> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
>> ^[[31m  at
>> org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)^[[0m
>> ^[[31m  at
>> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)^[[0m
>> ^[[31m  ...^[[0m
>> ^[[31m  Cause: java.util.concurrent.ExecutionException:
>> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
>> java.io.IOException: No such file or directory^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m
>> ^[[31m  at
>> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
>> ^[[31m  at
>> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
>> ^[[31m  at
>> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
>> ^[[31m  at
>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
>> ^[[31m  ...^[[0m
>> ^[[31m  Cause:
>> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
>> java.io.IOException: No such file or directory^[[0m
>> ^[[31m  at
>> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m
>> ^[[31m  at
>> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m
>> ^[[31m  at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m
>> ^[[31m  at java.util.concurrent.FutureTask.run(FutureTask.java:262)^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)^[[0m
>> ^[[31m  at
>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50)^[[0m
>> ^[[31m  at
>> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37)^[[0m
>> ^[[31m  at
>> 

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-04 Thread Egor Pahomov
+1

Things, which our infrastructure use and I checked:

Dynamic allocation
Spark ODBC server
Reading json
Writing parquet
SQL quires (hive context)
Running on CDH


2015-11-04 9:03 GMT-08:00 Sean Owen :

> As usual the signatures and licenses and so on look fine. I continue
> to get the same test failures on Ubuntu in Java 7/8:
>
> - Unpersisting HttpBroadcast on executors only in distributed mode ***
> FAILED ***
>
> But I continue to assume that's specific to tests and/or Ubuntu and/or
> the build profile, since I don't see any evidence of this in other
> builds on Jenkins. It's not a change from previous behavior, though it
> doesn't always happen either.
>
> On Tue, Nov 3, 2015 at 11:22 PM, Reynold Xin  wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if
> a
> > majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.5.2
> > [ ] -1 Do not release this package because ...
> >
> >
> > The release fixes 59 known issues in Spark 1.5.1, listed here:
> > http://s.apache.org/spark-1.5.2
> >
> > The tag to be voted on is v1.5.2-rc2:
> > https://github.com/apache/spark/releases/tag/v1.5.2-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > - as version 1.5.2-rc2:
> > https://repository.apache.org/content/repositories/orgapachespark-1153
> > - as version 1.5.2:
> > https://repository.apache.org/content/repositories/orgapachespark-1152
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/
> >
> >
> > ===
> > How can I help test this release?
> > ===
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > 
> > What justifies a -1 vote for this release?
> > 
> > -1 vote should occur for regressions from Spark 1.5.1. Bugs already
> present
> > in 1.5.1 will not block this release.
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 

*Sincerely yoursEgor Pakhomov, *

*AnchorFree*


Re: Workflow Scheduler for Spark

2014-09-28 Thread Egor Pahomov
I created Jira https://issues.apache.org/jira/browse/SPARK-3714 and design
doc
https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing
on
this matter.

2014-09-17 22:28 GMT+04:00 Reynold Xin r...@databricks.com:

 There might've been some misunderstanding. I was referring to the MLlib
 pipeline design doc when I said the design doc was posted, in response to
 the first paragraph of your original email.


 On Wed, Sep 17, 2014 at 2:47 AM, Egor Pahomov pahomov.e...@gmail.com
 wrote:

  It's doc about MLLib pipeline functionality. What about oozie-like
  workflow?
 
  2014-09-17 13:08 GMT+04:00 Mark Hamstra m...@clearstorydata.com:
 
   See https://issues.apache.org/jira/browse/SPARK-3530 and this doc,
   referenced in that JIRA:
  
  
  
 
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
  
   On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com
   wrote:
  
   I have problems using Oozie. For example it doesn't sustain spark
  context
   like ooyola job server does. Other than GUI interfaces like HUE it's
  hard
   to work with - scoozie stopped in development year ago(I spoke with
   creator) and oozie xml very hard to write.
   Oozie still have all documentation and code in MR model rather than in
   yarn
   model. And based on it's current speed of development I can't expect
   radical changes in nearest future. There is no Databricks for oozie,
   which would have people on salary to develop this kind of radical
  changes.
   It's dinosaur.
  
   Reunold, can you help finding this doc? Do you mean just pipelining
  spark
   code or additional logic of persistence tasks, job server, task retry,
   data
   availability and extra?
  
  
   2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com:
  
Hi Egor,
   
I think the design doc for the pipeline feature has been posted.
   
For the workflow, I believe Oozie actually works fine with Spark if
  you
want some external workflow system. Do you have any trouble using
  that?
   
   
On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov 
  pahomov.e...@gmail.com
wrote:
   
There are two things we(Yandex) miss in Spark: MLlib good
  abstractions
   and
good workflow job scheduler. From threads Adding abstraction in
  MlLib
and
[mllib] State of Multi-Model training I got the idea, that
  databricks
working on it and we should wait until first post doc, which would
  lead
us.
What about workflow scheduler? Is there anyone already working on
 it?
   Does
anyone have a plan on doing it?
   
P.S. We thought that MLlib abstractions about multiple algorithms
 run
   with
same data would need such scheduler, which would rerun algorithm in
   case
of
failure. I understand, that spark provide fault tolerance out of
 the
   box,
but we found some Ooozie-like scheduler more reliable for such
 long
living workflows.
   
--
   
   
   
*Sincerely yoursEgor PakhomovScala Developer, Yandex*
   
   
   
  
  
   --
  
  
  
   *Sincerely yoursEgor PakhomovScala Developer, Yandex*
  
  
  
 
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 




-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: MLlib enable extension of the LabeledPoint class

2014-09-25 Thread Egor Pahomov
@Yu Ishikawa,

*I think the right place for such discussion -
 https://issues.apache.org/jira/browse/SPARK-3573
https://issues.apache.org/jira/browse/SPARK-3573*


2014-09-25 18:02 GMT+04:00 Yu Ishikawa yuu.ishikawa+sp...@gmail.com:

 Hi Niklas Wilcke,

 As you said, it is difficult to extend LabeledPoint class in
 mllib.regression.
 Do you want to extend LabeledPoint class in order to use any other type
 exclude Double type?
 If you have your code on Github, could you show us it? I want to know what
 you want to do.

  Community
 By the way, I think LabeledPoint class is very useful exclude
 mllib.regression package.
 Especially, some estimation algorithms should use a type for the labels
 exclude Double type,
 such as String type. The common generics labeled-point class would be
 useful
 in MLlib.
 I'd like to get your thoughts on it.

 For example,
 ```
 abstract class LabeledPoint[T](label: T, features: Vector)
 ```

 thanks






 -
 -- Yu Ishikawa
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-enable-extension-of-the-LabeledPoint-class-tp8546p8549.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Workflow Scheduler for Spark

2014-09-17 Thread Egor Pahomov
It's doc about MLLib pipeline functionality. What about oozie-like
workflow?

2014-09-17 13:08 GMT+04:00 Mark Hamstra m...@clearstorydata.com:

 See https://issues.apache.org/jira/browse/SPARK-3530 and this doc,
 referenced in that JIRA:


 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

 On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com
 wrote:

 I have problems using Oozie. For example it doesn't sustain spark context
 like ooyola job server does. Other than GUI interfaces like HUE it's hard
 to work with - scoozie stopped in development year ago(I spoke with
 creator) and oozie xml very hard to write.
 Oozie still have all documentation and code in MR model rather than in
 yarn
 model. And based on it's current speed of development I can't expect
 radical changes in nearest future. There is no Databricks for oozie,
 which would have people on salary to develop this kind of radical changes.
 It's dinosaur.

 Reunold, can you help finding this doc? Do you mean just pipelining spark
 code or additional logic of persistence tasks, job server, task retry,
 data
 availability and extra?


 2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com:

  Hi Egor,
 
  I think the design doc for the pipeline feature has been posted.
 
  For the workflow, I believe Oozie actually works fine with Spark if you
  want some external workflow system. Do you have any trouble using that?
 
 
  On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov pahomov.e...@gmail.com
  wrote:
 
  There are two things we(Yandex) miss in Spark: MLlib good abstractions
 and
  good workflow job scheduler. From threads Adding abstraction in MlLib
  and
  [mllib] State of Multi-Model training I got the idea, that databricks
  working on it and we should wait until first post doc, which would lead
  us.
  What about workflow scheduler? Is there anyone already working on it?
 Does
  anyone have a plan on doing it?
 
  P.S. We thought that MLlib abstractions about multiple algorithms run
 with
  same data would need such scheduler, which would rerun algorithm in
 case
  of
  failure. I understand, that spark provide fault tolerance out of the
 box,
  but we found some Ooozie-like scheduler more reliable for such long
  living workflows.
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 
 
 


 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*





-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Adding abstraction in MLlib

2014-09-15 Thread Egor Pahomov
It's good, that databricks working on this issue! However current process
of working on that is not very clear for outsider.

   - Last update on this ticket is August 5. If all this time was active
   development, I have concerns that without feedback from community for such
   long time development can fall in wrong way.
   - Even if it would be great big patch as soon as you introduce new
   interfaces to community it would allow us to start working on our pipeline
   code. It would allow us write algorithm in new paradigm instead of in lack
   of any paradigms like it was before. It would allow us to help you transfer
   old code to new paradigm.

My main point - shorter iterations with more transparency.

I think it would be good idea to create some pull request with code, which
you have so far, even if it doesn't pass tests, so just we can comment on
it before formulating it in design doc.


2014-09-13 0:00 GMT+04:00 Patrick Wendell pwend...@gmail.com:

 We typically post design docs on JIRA's before major work starts. For
 instance, pretty sure SPARk-1856 will have a design doc posted
 shortly.

 On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson e...@redhat.com wrote:
 
  Are interface designs being captured anywhere as documents that the
 community can follow along with as the proposals evolve?
 
  I've worked on other open source projects where design docs were
 published as living documents (e.g. on google docs, or etherpad, but the
 particular mechanism isn't crucial).   FWIW, I found that to be a good way
 to work in a community environment.
 
 
  - Original Message -
  Hi Egor,
 
  Thanks for the feedback! We are aware of some of the issues you
  mentioned and there are JIRAs created for them. Specifically, I'm
  pushing out the design on pipeline features and algorithm/model
  parameters this week. We can move our discussion to
  https://issues.apache.org/jira/browse/SPARK-1856 .
 
  It would be nice to make tests against interfaces. But it definitely
  needs more discussion before making PRs. For example, we discussed the
  learning interfaces in Christoph's PR
  (https://github.com/apache/spark/pull/2137/) but it takes time to
  reach a consensus, especially on interfaces. Hopefully all of us could
  benefit from the discussion. The best practice is to break down the
  proposal into small independent piece and discuss them on the JIRA
  before submitting PRs.
 
  For performance tests, there is a spark-perf package
  (https://github.com/databricks/spark-perf) and we added performance
  tests for MLlib in v1.1. But definitely more work needs to be done.
 
  The dev-list may not be a good place for discussion on the design,
  could you create JIRAs for each of the issues you pointed out, and we
  track the discussion on JIRA? Thanks!
 
  Best,
  Xiangrui
 
  On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com
 wrote:
   Xiangrui can comment more, but I believe Joseph and him are actually
   working on standardize interface and pipeline feature for 1.2 release.
  
   On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com
 
   wrote:
  
   Some architect suggestions on this matter -
   https://github.com/apache/spark/pull/2371
  
   2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:
  
Sorry, I misswrote  - I meant learners part of framework - models
already
exists.
   
2014-09-12 15:53 GMT+04:00 Christoph Sawade 
christoph.saw...@googlemail.com:
   
I totally agree, and we discovered also some drawbacks with the
classification models implementation that are based on GLMs:
   
- There is no distinction between predicting scores, classes, and
calibrated scores (probabilities). For these models it is common
 to
have
access to all of them and the prediction function
 ``predict``should be
consistent and stateless. Currently, the score is only available
 after
removing the threshold from the model.
- There is no distinction between multinomial and binomial
classification. For multinomial problems, it is necessary to
 handle
multiple weight vectors and multiple confidences.
- Models are not serialisable, which makes it hard to use them in
practise.
   
I started a pull request [1] some time ago. I would be happy to
continue
the discussion and clarify the interfaces, too!
   
Cheers, Christoph
   
[1] https://github.com/apache/spark/pull/2137/
   
2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:
   
Here in Yandex, during implementation of gradient boosting in
 spark
and
creating our ML tool for internal use, we found next serious
 problems
   in
MLLib:
   
   
   - There is no Regression/Classification model abstraction. We
 were
   building abstract data processing pipelines, which should
 work just
with
   some regression - exact algorithm specified outside this code.
   There
is no
   abstraction

Adding abstraction in MLlib

2014-09-12 Thread Egor Pahomov
Here in Yandex, during implementation of gradient boosting in spark and
creating our ML tool for internal use, we found next serious problems in
MLLib:


   - There is no Regression/Classification model abstraction. We were
   building abstract data processing pipelines, which should work just with
   some regression - exact algorithm specified outside this code. There is no
   abstraction, which will allow me to do that. *(It's main reason for all
   further problems) *
   - There is no common practice among MLlib for testing algorithms: every
   model generates it's own random test data. There is no easy extractable
   test cases applible to another algorithm. There is no benchmarks for
   comparing algorithms. After implementing new algorithm it's very hard to
   understand how it should be tested.
   - Lack of serialization testing: MLlib algorithms don't contain tests
   which test that model work after serialization.
   - During implementation of new algorithm it's hard to understand what
   API you should create and which interface to implement.

Start for solving all these problems must be done in creating common
interface for typical algorithms/models - regression, classification,
clustering, collaborative filtering.

All main tests should be written against these interfaces, so when new
algorithm implemented - all it should do is passed already written tests.
It allow us to have managble quality among all lib.

There should be couple benchmarks which allow new spark user to get feeling
about which algorithm to use.

Test set against these abstractions should contain serialization test. In
production most time there is no need in model, which can't be stored.

As the first step of this roadmap I'd like to create trait RegressionModel,
*ADD* methods to current algorithms to implement this trait and create some
tests against it. Planning of doing it next week.

Purpose of this letter is to collect any objections to this approach on
early stage: please give any feedback. Second reason is to set lock on this
activity so we wouldn't do the same thing twice: I'll create pull request
by the end of the next week and any parallalizm in development we can start
from there.



-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Adding abstraction in MLlib

2014-09-12 Thread Egor Pahomov
Some architect suggestions on this matter -
https://github.com/apache/spark/pull/2371

2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:

 Sorry, I misswrote  - I meant learners part of framework - models already
 exists.

 2014-09-12 15:53 GMT+04:00 Christoph Sawade 
 christoph.saw...@googlemail.com:

 I totally agree, and we discovered also some drawbacks with the
 classification models implementation that are based on GLMs:

 - There is no distinction between predicting scores, classes, and
 calibrated scores (probabilities). For these models it is common to have
 access to all of them and the prediction function ``predict``should be
 consistent and stateless. Currently, the score is only available after
 removing the threshold from the model.
 - There is no distinction between multinomial and binomial
 classification. For multinomial problems, it is necessary to handle
 multiple weight vectors and multiple confidences.
 - Models are not serialisable, which makes it hard to use them in
 practise.

 I started a pull request [1] some time ago. I would be happy to continue
 the discussion and clarify the interfaces, too!

 Cheers, Christoph

 [1] https://github.com/apache/spark/pull/2137/

 2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:

 Here in Yandex, during implementation of gradient boosting in spark and
 creating our ML tool for internal use, we found next serious problems in
 MLLib:


- There is no Regression/Classification model abstraction. We were
building abstract data processing pipelines, which should work just
 with
some regression - exact algorithm specified outside this code. There
 is no
abstraction, which will allow me to do that. *(It's main reason for
 all
further problems) *
- There is no common practice among MLlib for testing algorithms:
 every
model generates it's own random test data. There is no easy
 extractable
test cases applible to another algorithm. There is no benchmarks for
comparing algorithms. After implementing new algorithm it's very hard
 to
understand how it should be tested.
- Lack of serialization testing: MLlib algorithms don't contain tests
which test that model work after serialization.
- During implementation of new algorithm it's hard to understand what
API you should create and which interface to implement.

 Start for solving all these problems must be done in creating common
 interface for typical algorithms/models - regression, classification,
 clustering, collaborative filtering.

 All main tests should be written against these interfaces, so when new
 algorithm implemented - all it should do is passed already written tests.
 It allow us to have managble quality among all lib.

 There should be couple benchmarks which allow new spark user to get
 feeling
 about which algorithm to use.

 Test set against these abstractions should contain serialization test. In
 production most time there is no need in model, which can't be stored.

 As the first step of this roadmap I'd like to create trait
 RegressionModel,
 *ADD* methods to current algorithms to implement this trait and create
 some
 tests against it. Planning of doing it next week.

 Purpose of this letter is to collect any objections to this approach on
 early stage: please give any feedback. Second reason is to set lock on
 this
 activity so we wouldn't do the same thing twice: I'll create pull request
 by the end of the next week and any parallalizm in development we can
 start
 from there.



 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*





 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*




-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-04 Thread Egor Pahomov
+1

Compiled, ran on yarn-hadoop-2.3 simple job.


2014-09-04 22:22 GMT+04:00 Henry Saputra henry.sapu...@gmail.com:

 LICENSE and NOTICE files are good
 Hash files are good
 Signature files are good
 No 3rd parties executables
 Source compiled
 Run local and standalone tests
 Test persist off heap with Tachyon looks good

 +1

 - Henry

 On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.1.0!
 
  The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc4/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1031/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.0!
 
  The vote is open until Saturday, September 06, at 08:30 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.1.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == Regressions fixed since RC3 ==
  SPARK-3332 - Issue with tagging in EC2 scripts
  SPARK-3358 - Issue with regression for m3.XX instances
 
  == What justifies a -1 vote for this release? ==
  This vote is happening very late into the QA period compared with
  previous votes, so -1 votes should only occur for significant
  regressions from 1.0.2. Bugs already present in 1.0.X will not block
  this release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.io.compression.codec is now snappy
  -- Old behavior can be restored by switching to lzf
 
  2. PySpark now performs external spilling during aggregations.
  -- Old behavior can be restored by setting spark.shuffle.spill to
 false.
 
  3. PySpark uses a new heuristic for determining the parallelism of
  shuffle operations.
  -- Old behavior can be restored by setting
  spark.default.parallelism to the number of cores in the cluster.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


How pySpark works?

2014-07-11 Thread Egor Pahomov
Hi, I want to use pySpark, but can't understand how it works. Documentation
doesn't provide enough information.

1) How python shipped to cluster? Should machines in cluster already have
python?
2) What happens when I write some python code in map function - is it
shipped to cluster and just executed on it? How it understand all
dependencies, which my code need and ship it there? If I use Math in my
code in map does it mean, that I would ship Math class or some python
Math on cluster would be used?
3) I have c++ compiled code. Can I ship this executable with addPyFile
and just use exec function from python? Would it work?

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Random forest - is it under implementation?

2014-07-11 Thread Egor Pahomov
Hi, I have intern, who wants to implement some ML algorithm for spark.
Which algorithm would be good idea to implement(it should be not very
difficult)? I heard someone already working on random forest, but couldn't
find proof of that.

I'm aware of new politics, where we should implement stable, good quality,
popular ML or do not do it at all.

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Random forest - is it under implementation?

2014-07-11 Thread Egor Pahomov
Great. Then one question left:
what would you recommend for implementation?



2014-07-11 17:43 GMT+04:00 Chester At Work ches...@alpinenow.com:

 Sung chung from alpine data labs presented the random Forrest
 implementation at Spark summit 2014. The work will be open sourced and
 contributed back to MLLib.

 Stay tuned



 Sent from my iPad

 On Jul 11, 2014, at 6:02 AM, Egor Pahomov pahomov.e...@gmail.com wrote:

  Hi, I have intern, who wants to implement some ML algorithm for spark.
  Which algorithm would be good idea to implement(it should be not very
  difficult)? I heard someone already working on random forest, but
 couldn't
  find proof of that.
 
  I'm aware of new politics, where we should implement stable, good
 quality,
  popular ML or do not do it at all.
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*




-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Ping on SPARK-1177

2014-03-16 Thread Egor Pahomov
Spark documentation and spark code helps you run your application from
shell. In my company it's not convenient - we run cluster task from code in
our web service. It took me a lot of time to bring as much configuration in
code as I can, because configuration at process start - quite hard in our
realities. I'd like to make some patches and write some documentation,
which bring our practices to Spark. So please help me do it by reviewing
this first patch  - https://github.com/apache/spark/pull/82

-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*