Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Welcome, Shane. As a former prof and eng dir at Google, I've been expecting this to be a first-class engineering college subject. I just didn't expect it to come through this route :-) So congrats, and I hope you represent the beginning of a great new trend at universities. Sent while mobile. Please excuse typos etc. On Sep 2, 2014 11:00 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Shane, Thanks for your work so far and I'm really happy to see investment in this infrastructure. This is a key productivity tool for us and something we'd love to expand over time to improve the development process of Spark. - Patrick On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hi Shane! Thank you for doing the Jenkins upgrade last week. It's nice to know that infrastructure is gonna get some dedicated TLC going forward. Welcome aboard! Nick On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: CoHadoop Papers
Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS? If the former, Spark does support copartitioning. If the latter, it's an HDFS scope that's outside of Spark. On that note, Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm sure the paper makes useful contributions for its set of use cases. Sent while mobile. Pls excuse typos etc. On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote: It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce jobs. Would leveraging these ideas for writing data from Spark make sense/be worthwhile?
Re: Welcoming two new committers
+1 Joey Andrew :) -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com [ah-'DAY-tao] linkedin.com/in/ctnguyen On Thu, Aug 7, 2014 at 10:39 PM, Joseph Gonzalez jegon...@eecs.berkeley.edu wrote: Hi Everyone, Thank you for inviting me to be a committer. I look forward to working with everyone to ensure the continued success of the Spark project. Thanks! Joey On Thu, Aug 7, 2014 at 9:57 PM, Matei Zaharia ma...@databricks.com wrote: Hi everyone, The PMC recently voted to add two new committers and PMC members: Joey Gonzalez and Andrew Or. Both have been huge contributors in the past year -- Joey on much of GraphX as well as quite a bit of the initial work in MLlib, and Andrew on Spark Core. Join me in welcoming them as committers! Matei
Re: Dynamic variables in Spark
Hi Neil, first off, I'm generally a sympathetic advocate for making changes to Spark internals to make it easier/better/faster/more awesome. In this case, I'm (a) not clear about what you're trying to accomplish, and (b) a bit worried about the proposed solution. On (a): it is stated that you want to pass some Accumulators around. Yet the proposed solution is for some shared variable that may be set and mapped out and possibly reduced back, but without any accompanying accumulation semantics. And yet it doesn't seem like you only want just the broadcast property. Can you clarify the problem statement with some before/after client code examples? On (b): you're right that adding variables to SparkContext should be done with caution, as it may have unintended consequences beyond just serdes payload size. For example, there is a stated intention of supporting multiple SparkContexts in the future, and this proposed solution can make it a bigger challenge to do so. Indeed, we had a gut-wrenching call to make a while back on a subject related to this (see https://github.com/mesos/spark/pull/779). Furthermore, even in a single SparkContext application, there may be multiple clients (of that application) whose intent to use the proposed SparkDynamic would not necessarily be coordinated. So, considering a ratio of a/b (benefit/cost), it's not clear to me that the benefits are significant enough to warrant the costs. Do I misunderstand that the benefit is to save one explicit parameter (the context) in the signature/closure code? -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Mon, Jul 21, 2014 at 2:10 PM, Neil Ferguson nfergu...@gmail.com wrote: Hi all I have been adding some metrics to the ADAM project https://github.com/bigdatagenomics/adam, which runs on Spark, and have a proposal for an enhancement to Spark that would make this work cleaner and easier. I need to pass some Accumulators around, which will aggregate metrics (timing stats and other metrics) across the cluster. However, it is cumbersome to have to explicitly pass some context containing these accumulators around everywhere that might need them. I can use Scala implicits, which help slightly, but I'd still need to modify every method in the call stack to take an implicit variable. So, I'd like to propose that we add the ability to have dynamic variables (basically thread-local variables) to Spark. This would avoid having to pass the Accumulators around explicitly. My proposed approach is to add a method to the SparkContext class as follows: /** * Sets the value of a dynamic variable. This value is made available to jobs * without having to be passed around explicitly. During execution of a Spark job * this value can be obtained from the [[SparkDynamic]] object. */ def setDynamicVariableValue(value: Any) Then, when a job is executing the SparkDynamic can be accessed to obtain the value of the dynamic variable. The implementation of this object is as follows: object SparkDynamic { private val dynamicVariable = new DynamicVariable[Any]() /** * Gets the value of the dynamic variable that has been set in the [[SparkContext]] */ def getValue: Option[Any] = { Option(dynamicVariable.value) } private[spark] def withValue[S](threadValue: Option[Any])(thunk: = S): S = { dynamicVariable.withValue(threadValue.orNull)(thunk) } } The change involves modifying the Task object to serialize the value of the dynamic variable, and modifying the TaskRunner class to deserialize the value and make it available in the thread that is running the task (using the SparkDynamic.withValue method). I have done a quick prototype of this in this commit: https://github.com/nfergu/spark/commit/8be28d878f43ad6c49f892764011ae7d273dcea6 and it seems to work fine in my (limited) testing. It needs more testing, tidy-up and documentation though. One drawback is that the dynamic variable will be serialized for every Task whether it needs it or not. For my use case this might not be too much of a problem, as serializing and deserializing Accumulators looks fairly lightweight -- however we should certainly warn users against setting a dynamic variable containing lots of data. I thought about using broadcast tables here, but I don't think it's possible to put Accumulators in a broadcast table (as I understand it, they're intended for purely read-only data). What do people think about this proposal? My use case aside, it seems like it would be a generally useful enhancment to be able to pass certain data around without having to explicitly pass it everywhere. Neil
Re: Opiq for SParkSQL?
Yan, it looks like Julian did anticipate exactly this possibility: https://github.com/julianhyde/optiq/tree/master/spark Optiq is a cool project vision in terms of hiding various engines behind one consistent API. That said, from just the Spark perspective, I don't see a huge value add to layer Optiq above SparkSQL---until and unless Optiq provides a lot more idioms and/or operational facilities than just making Spark RDDs look like tables, which SparkSQL already does quite nicely and increasingly. Warehousing, perhaps? Here, I can't avoid a mention of DDF which aims to add more algorithmic and data manipulation value in addition to the table abstraction ( https://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us ) -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, Jun 6, 2014 at 11:26 AM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Can anybody share your thoughts/comments/interests of applicability of the optiq framework to Spark, and SparkSQL in particular? Thanks,
Re: Announcing Spark 1.0.0
Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: LogisticRegression: Predicting continuous outcomes
Bharath, (apologies if you're already familiar with the theory): the proposed approach may or may not be appropriate depending on the overall transfer function in your data. In general, a single logistic regressor cannot approximate arbitrary non-linear functions (of linear combinations of the inputs). You can review works by, e.g., Hornik and Cybenko in the late 80's to see if you need something more, such as a simple, one hidden-layer neural network. This is a good summary: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.2647rep=rep1type=pdf -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar reachb...@gmail.comwrote: I'm looking to reuse the LogisticRegression model (with SGD) to predict a real-valued outcome variable. (I understand that logistic regression is generally applied to predict binary outcome, but for various reasons, this model suits our needs better than LinearRegression). Related to that I have the following questions: 1) Can the current LogisticRegression model be used as is to train based on binary input (i.e. explanatory) features, or is there an assumption that the explanatory features must be continuous? 2) I intend to reuse the current class to train a model on LabeledPoints where the label is a real value (and not 0 / 1). I'd like to know if invoking setValidateData(false) would suffice or if one must override the validator to achieve this. 3) I recall seeing an experimental method on the class ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ) that clears the threshold separating positive negative predictions. Once the model is trained on real valued labels, would clearing this flag suffice to predict an outcome that is continous in nature? Thanks, Bharath P.S: I'm writing to dev@ and not user@ assuming that lib changes might be necessary. Apologies if the mailing list is incorrect.
Re: can RDD be shared across mutil spark applications?
Qing Yang, Andy is correct in answering your direct question. At the same time, depending on your context, you may be able to apply a pattern where you turn the single Spark application into a service, and multiple clients if that service can indeed share access to the same RDDs. Several groups have built apps based on this pattern, and we will also show something with this behavior at the upcoming Spark Summit (multiple users collaborating on named DDFs with the same underlying RDDs). Sent while mobile. Pls excuse typos etc. On May 18, 2014 9:40 AM, Andy Konwinski andykonwin...@gmail.com wrote: RDDs cannot currently be shared across multiple SparkContexts without using something like the Tachyon project (which is a separate project/codebase). Andy On May 16, 2014 2:14 PM, qingyang li liqingyang1...@gmail.com wrote: