Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Christopher Nguyen
Welcome, Shane. As a former prof and eng dir at Google, I've been expecting
this to be a first-class engineering college subject. I just didn't expect
it to come through this route :-)

So congrats, and I hope you represent the beginning of a great new trend at
universities.

Sent while mobile. Please excuse typos etc.
On Sep 2, 2014 11:00 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Shane,

 Thanks for your work so far and I'm really happy to see investment in
 this infrastructure. This is a key productivity tool for us and
 something we'd love to expand over time to improve the development
 process of Spark.

 - Patrick

 On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Hi Shane!
 
  Thank you for doing the Jenkins upgrade last week. It's nice to know that
  infrastructure is gonna get some dedicated TLC going forward.
 
  Welcome aboard!
 
  Nick
 
 
  On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote:
 
  so, i had a meeting w/the databricks guys on friday and they
 recommended i
  send an email out to the list to say 'hi' and give you guys a quick
 intro.
   :)
 
  hi!  i'm shane knapp, the new AMPLab devops engineer, and will be
 spending
  time getting the jenkins build infrastructure up to production quality.
   much of this will be 'under the covers' work, like better system level
  auth, backups, etc, but some will definitely be user facing:  timely
  jenkins updates, debugging broken build infrastructure and some plugin
  support.
 
  i've been working in the bay area now since 1997 at many different
  companies, and my last 10 years has been split between google and
 palantir.
   i'm a huge proponent of OSS, and am really happy to be able to help
 with
  the work you guys are doing!
 
  if anyone has any requests/questions/comments, feel free to drop me a
 line!
 
  shane
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: CoHadoop Papers

2014-08-26 Thread Christopher Nguyen
Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?

If the former, Spark does support copartitioning.

If the latter, it's an HDFS scope that's outside of Spark. On that note,
Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
sure the paper makes useful contributions for its set of use cases.

Sent while mobile. Pls excuse typos etc.
On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:

 It appears support for this type of control over block placement is going
 out in the next version of HDFS:
 https://issues.apache.org/jira/browse/HDFS-2576


 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

  One of my colleagues has been questioning me as to why Spark/HDFS makes
 no
  attempts to try to co-locate related data blocks.  He pointed to this
  paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
  CoHadoop research and the performance improvements it yielded for
  Map/Reduce jobs.
 
  Would leveraging these ideas for writing data from Spark make sense/be
  worthwhile?
 
 
 



Re: Welcoming two new committers

2014-08-08 Thread Christopher Nguyen
+1 Joey  Andrew :)

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com [ah-'DAY-tao]
linkedin.com/in/ctnguyen



On Thu, Aug 7, 2014 at 10:39 PM, Joseph Gonzalez jegon...@eecs.berkeley.edu
 wrote:

 Hi Everyone,

 Thank you for inviting me to be a committer.  I look forward to working
 with everyone to ensure the continued success of the Spark project.

 Thanks!
 Joey




 On Thu, Aug 7, 2014 at 9:57 PM, Matei Zaharia ma...@databricks.com
 wrote:

  Hi everyone,
 
  The PMC recently voted to add two new committers and PMC members: Joey
  Gonzalez and Andrew Or. Both have been huge contributors in the past year
  -- Joey on much of GraphX as well as quite a bit of the initial work in
  MLlib, and Andrew on Spark Core. Join me in welcoming them as committers!
 
  Matei
 
 
 
 



Re: Dynamic variables in Spark

2014-07-21 Thread Christopher Nguyen
Hi Neil, first off, I'm generally a sympathetic advocate for making changes
to Spark internals to make it easier/better/faster/more awesome.

In this case, I'm (a) not clear about what you're trying to accomplish, and
(b) a bit worried about the proposed solution.

On (a): it is stated that you want to pass some Accumulators around. Yet
the proposed solution is for some shared variable that may be set and
mapped out and possibly reduced back, but without any accompanying
accumulation semantics. And yet it doesn't seem like you only want just the
broadcast property. Can you clarify the problem statement with some
before/after client code examples?

On (b): you're right that adding variables to SparkContext should be done
with caution, as it may have unintended consequences beyond just serdes
payload size. For example, there is a stated intention of supporting
multiple SparkContexts in the future, and this proposed solution can make
it a bigger challenge to do so. Indeed, we had a gut-wrenching call to make
a while back on a subject related to this (see
https://github.com/mesos/spark/pull/779). Furthermore, even in a single
SparkContext application, there may be multiple clients (of that
application) whose intent to use the proposed SparkDynamic would not
necessarily be coordinated.

So, considering a ratio of a/b (benefit/cost), it's not clear to me that
the benefits are significant enough to warrant the costs. Do I
misunderstand that the benefit is to save one explicit parameter (the
context) in the signature/closure code?

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Mon, Jul 21, 2014 at 2:10 PM, Neil Ferguson nfergu...@gmail.com wrote:

 Hi all

 I have been adding some metrics to the ADAM project
 https://github.com/bigdatagenomics/adam, which runs on Spark, and have a
 proposal for an enhancement to Spark that would make this work cleaner and
 easier.

 I need to pass some Accumulators around, which will aggregate metrics
 (timing stats and other metrics) across the cluster. However, it is
 cumbersome to have to explicitly pass some context containing these
 accumulators around everywhere that might need them. I can use Scala
 implicits, which help slightly, but I'd still need to modify every method
 in the call stack to take an implicit variable.

 So, I'd like to propose that we add the ability to have dynamic variables
 (basically thread-local variables) to Spark. This would avoid having to
 pass the Accumulators around explicitly.

 My proposed approach is to add a method to the SparkContext class as
 follows:

 /**
  * Sets the value of a dynamic variable. This value is made available to
 jobs
  * without having to be passed around explicitly. During execution of a
 Spark job
  * this value can be obtained from the [[SparkDynamic]] object.
  */
 def setDynamicVariableValue(value: Any)

 Then, when a job is executing the SparkDynamic can be accessed to obtain
 the value of the dynamic variable. The implementation of this object is as
 follows:

 object SparkDynamic {
   private val dynamicVariable = new DynamicVariable[Any]()
   /**
* Gets the value of the dynamic variable that has been set in the
 [[SparkContext]]
*/
   def getValue: Option[Any] = {
 Option(dynamicVariable.value)
   }
   private[spark] def withValue[S](threadValue: Option[Any])(thunk: = S): S
 = {
 dynamicVariable.withValue(threadValue.orNull)(thunk)
   }
 }

 The change involves modifying the Task object to serialize the value of the
 dynamic variable, and modifying the TaskRunner class to deserialize the
 value and make it available in the thread that is running the task (using
 the SparkDynamic.withValue method).

 I have done a quick prototype of this in this commit:

 https://github.com/nfergu/spark/commit/8be28d878f43ad6c49f892764011ae7d273dcea6
 and it seems to work fine in my (limited) testing. It needs more testing,
 tidy-up and documentation though.

 One drawback is that the dynamic variable will be serialized for every Task
 whether it needs it or not. For my use case this might not be too much of a
 problem, as serializing and deserializing Accumulators looks fairly
 lightweight -- however we should certainly warn users against setting a
 dynamic variable containing lots of data. I thought about using broadcast
 tables here, but I don't think it's possible to put Accumulators in a
 broadcast table (as I understand it, they're intended for purely read-only
 data).

 What do people think about this proposal? My use case aside, it seems like
 it would be a generally useful enhancment to be able to pass certain data
 around without having to explicitly pass it everywhere.

 Neil



Re: Opiq for SParkSQL?

2014-06-07 Thread Christopher Nguyen
Yan, it looks like Julian did anticipate exactly this possibility:

https://github.com/julianhyde/optiq/tree/master/spark

Optiq is a cool project vision in terms of hiding various engines behind
one consistent API.

That said, from just the Spark perspective, I don't see a huge value add to
layer Optiq above SparkSQL---until and unless Optiq provides a lot more
idioms and/or operational facilities than just making Spark RDDs look like
tables, which SparkSQL already does quite nicely and increasingly.
Warehousing, perhaps?

Here, I can't avoid a mention of DDF which aims to add more algorithmic and
data manipulation value in addition to the table abstraction (
https://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
)
--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Fri, Jun 6, 2014 at 11:26 AM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Can anybody share your thoughts/comments/interests of applicability of the
 optiq  framework to Spark, and SparkSQL in particular?

 Thanks,



Re: Announcing Spark 1.0.0

2014-05-30 Thread Christopher Nguyen
Awesome work, Pat et al.!

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick



Re: LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Christopher Nguyen
Bharath, (apologies if you're already familiar with the theory): the
proposed approach may or may not be appropriate depending on the overall
transfer function in your data. In general, a single logistic regressor
cannot approximate arbitrary non-linear functions (of linear combinations
of the inputs). You can review works by, e.g., Hornik and Cybenko in the
late 80's to see if you need something more, such as a simple, one
hidden-layer neural network.

This is a good summary:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.2647rep=rep1type=pdf

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar reachb...@gmail.comwrote:

 I'm looking to reuse the LogisticRegression model (with SGD) to predict a
 real-valued outcome variable. (I understand that logistic regression is
 generally applied to predict binary outcome, but for various reasons, this
 model suits our needs better than LinearRegression). Related to that I have
 the following questions:

 1) Can the current LogisticRegression model be used as is to train based on
 binary input (i.e. explanatory) features, or is there an assumption that
 the explanatory features must be continuous?

 2) I intend to reuse the current class to train a model on LabeledPoints
 where the label is a real value (and not 0 / 1). I'd like to know if
 invoking setValidateData(false) would suffice or if one must override the
 validator to achieve this.

 3) I recall seeing an experimental method on the class (

 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 )
 that clears the threshold separating positive  negative predictions. Once
 the model is trained on real valued labels, would clearing this flag
 suffice to predict an outcome that is continous in nature?

 Thanks,
 Bharath

 P.S: I'm writing to dev@ and not user@ assuming that lib changes might be
 necessary. Apologies if the mailing list is incorrect.



Re: can RDD be shared across mutil spark applications?

2014-05-17 Thread Christopher Nguyen
Qing Yang, Andy is correct in answering your direct question.

At the same time, depending on your context, you may be able to apply a
pattern where you turn the single Spark application into a service, and
multiple clients if that service can indeed share access to the same RDDs.

Several groups have built apps based on this pattern, and we will also show
something with this behavior at the upcoming Spark Summit (multiple users
collaborating on named DDFs with the same underlying RDDs).

Sent while mobile. Pls excuse typos etc.
On May 18, 2014 9:40 AM, Andy Konwinski andykonwin...@gmail.com wrote:

 RDDs cannot currently be shared across multiple SparkContexts without using
 something like the Tachyon project (which is a separate project/codebase).

 Andy
 On May 16, 2014 2:14 PM, qingyang li liqingyang1...@gmail.com wrote: