Git Achievements

2015-02-22 Thread Nicholas Chammas
For fun:

http://acha-acha.co/#/repo/https://github.com/apache/spark

I just added Spark to this site. Some of these “achievements” are hilarious.

Leo Tolstoy: More than 10 lines in a commit message

Dangerous Game: Commit after 6PM friday

Nick
​


Re: Improving metadata in Spark JIRA

2015-02-22 Thread Sean Owen
Open pull request count is down to 254 right now from ~325 several weeks
ago.
Open JIRA count is down slightly to 1262 from a peak over ~1320.
Obviously, in the face of an ever faster and larger stream of contributions.

There's a real positive impact of JIRA being a little more meaningful, a
little less backlog to keep looking at, getting commits in slightly faster,
slightly happier contributors, etc.


The virtuous circle can keep going. It'd be great if every contributor
could take a moment to look at his or her open PRs and JIRAs. Example
searches (replace with your user name / name):

https://github.com/apache/spark/pulls/srowen
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20reporter%20%3D%20%22Sean%20Owen%22%20or%20assignee%20%3D%20%22Sean%20Owen%22

For PRs:

- if it appears to be waiting on your action or feedback,
  - push more changes and/or reply to comments, or
  - if it isn't work you can pursue in the immediate future, close the PR

- if it appears to be waiting on others,
  - if it's had feedback and it's unclear whether there's support to commit
as-is,
- break down or reduce the change to something less controversial
- close the PR as softly rejected
  - if there's no feedback or plainly waiting for action, ping @them

For JIRAs:

- If it's fixed along the way, or obsolete, resolve as Fixed or NotAProblem

- Do a quick search to see if a similar issue has been filed and is
resolved or has more activity; resolve as Duplicate if so

- Check that fields are assigned reasonably:
  - Meaningful title and description
  - Reasonable type and priority. Not everything is a major bug, and few
are blockers
  - 1+ Component
  - 1+ Affects version
  - Avoid setting target version until it looks like there's momentum to
merge a resolution

- If the JIRA has had no activity in a long time (6+ months), but does not
feel obsolete, try to move it to some resolution:
  - Request feedback, from specific people if desired, to feel out if there
is any other support for the change
  - Add more info, like a specific reproduction for bugs
  - Narrow scope of feature requests to something that contains a few
actionable steps, instead of broad open-ended wishes
  - Work on a fix. In an ideal world people are willing to work to resolve
JIRAs they open, and don't fire-and-forget


If everyone did this, not only would it advance the house-cleaning a bit
more, but I'm sure we'd rediscover some important work and issues that need
attention.


On Sun, Feb 22, 2015 at 7:54 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> As of right now, there are no more open JIRA issues without an assigned
> component
> !
> Hurray!
>
> [image: yay]
>
> Thanks to Sean and others for the cleanup!
>
> Nick
>
> ​
>


Re: Improving metadata in Spark JIRA

2015-02-22 Thread Nicholas Chammas
Open pull request count is down to 254 right now from ~325 several weeks
ago.

This great. Ideally, we need to get this down to < 50 and keep it there.
Having so many open pull requests is just a bad signal to contributors. But
it will take some time to get there.


   - 1+ Component

 Sean, do you have permission to edit our JIRA settings? It should be
possible to enforce this in JIRA itself.


   - 1+ Affects version

 I don’t think this field makes sense for improvements, right?

Nick
​

On Sun Feb 22 2015 at 9:43:24 AM Sean Owen  wrote:

> Open pull request count is down to 254 right now from ~325 several weeks
> ago.
> Open JIRA count is down slightly to 1262 from a peak over ~1320.
> Obviously, in the face of an ever faster and larger stream of
> contributions.
>
> There's a real positive impact of JIRA being a little more meaningful, a
> little less backlog to keep looking at, getting commits in slightly faster,
> slightly happier contributors, etc.
>
>
> The virtuous circle can keep going. It'd be great if every contributor
> could take a moment to look at his or her open PRs and JIRAs. Example
> searches (replace with your user name / name):
>
> https://github.com/apache/spark/pulls/srowen
> https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20SPARK%20AND%20reporter%20%3D%20%22Sean%20Owen%22%
> 20or%20assignee%20%3D%20%22Sean%20Owen%22
>
> For PRs:
>
> - if it appears to be waiting on your action or feedback,
>   - push more changes and/or reply to comments, or
>   - if it isn't work you can pursue in the immediate future, close the PR
>
> - if it appears to be waiting on others,
>   - if it's had feedback and it's unclear whether there's support to commit
> as-is,
> - break down or reduce the change to something less controversial
> - close the PR as softly rejected
>   - if there's no feedback or plainly waiting for action, ping @them
>
> For JIRAs:
>
> - If it's fixed along the way, or obsolete, resolve as Fixed or NotAProblem
>
> - Do a quick search to see if a similar issue has been filed and is
> resolved or has more activity; resolve as Duplicate if so
>
> - Check that fields are assigned reasonably:
>   - Meaningful title and description
>   - Reasonable type and priority. Not everything is a major bug, and few
> are blockers
>   - 1+ Component
>   - 1+ Affects version
>   - Avoid setting target version until it looks like there's momentum to
> merge a resolution
>
> - If the JIRA has had no activity in a long time (6+ months), but does not
> feel obsolete, try to move it to some resolution:
>   - Request feedback, from specific people if desired, to feel out if there
> is any other support for the change
>   - Add more info, like a specific reproduction for bugs
>   - Narrow scope of feature requests to something that contains a few
> actionable steps, instead of broad open-ended wishes
>   - Work on a fix. In an ideal world people are willing to work to resolve
> JIRAs they open, and don't fire-and-forget
>
>
> If everyone did this, not only would it advance the house-cleaning a bit
> more, but I'm sure we'd rediscover some important work and issues that need
> attention.
>
>
> On Sun, Feb 22, 2015 at 7:54 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> > As of right now, there are no more open JIRA issues without an assigned
> > component
> >  3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%
> 20component%20%3D%20EMPTY%20ORDER%20BY%20updated%20DESC>!
> > Hurray!
> >
> > [image: yay]
> >
> > Thanks to Sean and others for the cleanup!
> >
> > Nick
> >
> > ​
> >
>


textFile() ordering and header rows

2015-02-22 Thread Michael Malak
Since RDDs are generally unordered, aren't things like textFile().first() not 
guaranteed to return the first row (such as looking for a header row)? If so, 
doesn't that make the example in 
http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: textFile() ordering and header rows

2015-02-22 Thread Nicholas Chammas
I guess on a technicality the docs just say "first item in this RDD", not
"first line in the source text file". AFAIK there is no way apart from
filtering to remove header lines
.

As long as first() always returns the same value for a given RDD, I think
it's fine, no?

Nick


On Sun Feb 22 2015 at 9:09:01 PM Michael Malak
 wrote:

> Since RDDs are generally unordered, aren't things like textFile().first()
> not guaranteed to return the first row (such as looking for a header row)?
> If so, doesn't that make the example in
> http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Have Friedman's glmnet algo running in Spark

2015-02-22 Thread Joseph Bradley
Hi Mike,

glmnet has definitely been very successful, and it would be great to see
how we can improve optimization in MLlib!  There is some related work
ongoing; here are the JIRAs:

GLMNET implementation in Spark


LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package


The GLMNET JIRA has actually been closed in favor of the latter JIRA.
However, if you're getting good results in your experiments, could you
please post them on the GLMNET JIRA and link them from the other JIRA?  If
it's faster and more scalable, that would be great to find out.

As far as where the code should go and the APIs, that can be discussed on
the JIRA.

I hope this helps, and I'll keep an eye out for updates on the JIRAs!

Joseph


On Thu, Feb 19, 2015 at 10:59 AM,  wrote:

> Dev List,
> A couple of colleagues and I have gotten several versions of glmnet algo
> coded and running on Spark RDD. glmnet algo (
> http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for
> generating coefficient paths solving penalized regression with elastic net
> penalties. The algorithm runs fast by taking an approach that generates
> solutions for a wide variety of penalty parameter. We're able to integrate
> into Mllib class structure a couple of different ways. The algorithm may
> fit better into the new pipeline structure since it naturally returns a
> multitide of models (corresponding to different vales of penalty
> parameters). That appears to fit better into pipeline than Mllib linear
> regression (for example).
>
> We've got regression running with the speed optimizations that Friedman
> recommends. We'll start working on the logistic regression version next.
>
> We're eager to make the code available as open source and would like to
> get some feedback about how best to do that. Any thoughts?
> Mike Bowles.
>
>
>


Re: Spark SQL - Long running job

2015-02-22 Thread Cheng Lian
How about persisting the computed result table first before caching it? 
So that you only need to cache the result table after restarting your 
service without recomputing it. Somewhat like checkpointing.


Cheng

On 2/22/15 12:55 AM, nitin wrote:

Hi All,

I intend to build a long running spark application which fetches data/tuples
from parquet, does some processing(time consuming) and then cache the
processed table (InMemoryColumnarTableScan). My use case is good retrieval
time for SQL query(benefits of Spark SQL optimizer) and data
compression(in-built in in-memory caching). Now the problem is that if my
driver goes down, I will have to fetch the data again for all the tables and
compute it and cache which is time consuming.

Is it possible to persist processed/cached RDDs on disk such that my system
up time is less when restarted after failure/going down?

On a side note, the data processing contains a shuffle step which creates
huge temporary shuffle files on local disk in temp folder and as per current
logic, shuffle files don't get deleted for running executors. This is
leading to my local disk getting filled up quickly and going out of space as
its a long running spark job. (running spark in yarn-client mode btw).

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL - Long running job

2015-02-22 Thread nitin
I believe calling processedSchemaRdd.persist(DISK) and
processedSchemaRdd.checkpoint() only persists data and I will lose all the
RDD metadata and when I re-start my driver, that data is kind of useless for
me (correct me if I am wrong).

I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system)
but I fear that in case my "HDFS block size" > "partition file size", I will
get more partitions when reading instead of original schemaRdd. 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-22 Thread Mark Hamstra
So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
every combination of Hadoop and Hive versions, etc., can be supported, but
even an example build from the "Building Spark" page isn't looking too good
to me.

Working from f97b0d4, the example build command works: mvn -Pyarn
-Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0
-Phive-thriftserver -DskipTests clean package
...but then running the tests results in multiple failures in the Hive and
Hive Thrift Server sub-projects.


On Wed, Feb 18, 2015 at 12:12 AM, Patrick Wendell 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.3.0!
>
> The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1069/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.3.0!
>
> The vote is open until Saturday, February 21, at 08:03 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.2 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.3 QA period,
> so -1 votes should only occur for significant regressions from 1.2.1.
> Bugs already present in 1.2.X, minor regressions, or bugs related
> to new features will not block this release.
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>