Re: A proposal for Spark 2.0

2015-11-25 Thread Sandy Ryza
s, I think it's better > to make the other small changes in 2.0 at the same time than to update once > for Dataset and another time for 2.0. > > BTW just refer to Reynold's original post for the other proposed API > changes. > > Matei > > On Nov 24, 2015, at 1

Re: A proposal for Spark 2.0

2015-11-24 Thread Sandy Ryza
I think that Kostas' logic still holds. The majority of Spark users, and likely an even vaster majority of people running vaster jobs, are still on RDDs and on the cusp of upgrading to DataFrames. Users will probably want to upgrade to the stable version of the Dataset / DataFrame API so they don

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Sandy Ryza
To answer your fourth question from Cloudera's perspective, we would never support a customer running Spark 2.0 on a Hadoop version < 2.6. -Sandy On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin wrote: > OK I'm not exactly asking for a vote here :) > > I don't think we should look at it from only m

Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Oh and another question - should Spark 2.0 support Java 7? On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza wrote: > Another +1 to Reynold's proposal. > > Maybe this is obvious, but I'd like to advocate against a blanket removal > of deprecated / developer APIs. Many APIs

Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Another +1 to Reynold's proposal. Maybe this is obvious, but I'd like to advocate against a blanket removal of deprecated / developer APIs. Many APIs can likely be removed without material impact (e.g. the SparkContext constructor that takes preferred node location data), while others likely see

Re: Info about Dataset

2015-11-03 Thread Sandy Ryza
Hi Justin, The Dataset API proposal is available here: https://issues.apache.org/jira/browse/SPARK-. -Sandy On Tue, Nov 3, 2015 at 1:41 PM, Justin Uang wrote: > Hi, > > I was looking through some of the PRs slated for 1.6.0 and I noted > something called a Dataset, which looks like a new c

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-30 Thread Sandy Ryza
+1 (non-binding) built from source and ran some jobs against YARN -Sandy On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan wrote: > > +1 (1.5.0 RC2)Compiled on Windows with YARN. > > Regards, > Vaquar khan > +1 (non-binding, of course) > > 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min >

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Sandy Ryza
I see that there's an 1.5.0-rc2 tag in github now. Is that the official RC2 tag to start trying out? -Sandy On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen wrote: > PS Shixiong Zhu is correct that this one has to be fixed: > https://issues.apache.org/jira/browse/SPARK-10168 > > For example you can

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Sandy Ryza
Cool, thanks! On Mon, Aug 24, 2015 at 2:07 PM, Reynold Xin wrote: > Nope --- I cut that last Friday but had an error. I will remove it and cut > a new one. > > > On Mon, Aug 24, 2015 at 2:06 PM, Sandy Ryza > wrote: > >> I see that there's an 1.5.0-rc2 tag in g

Re: Developer API & plugins for Hive & Hadoop ?

2015-08-13 Thread Sandy Ryza
Hi Tom, Not sure how much this helps, but are you aware that you can build Spark with the -Phadoop-provided profile to avoid packaging Hadoop dependencies in the assembly jar? -Sandy On Fri, Aug 14, 2015 at 6:08 AM, Thomas Dudziak wrote: > Unfortunately it doesn't because our version of Hive h

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
rouped > one > > 2015-07-19 21:26 GMT+03:00 Sandy Ryza : > >> The user gets to choose what they want to reside in memory. If they call >> rdd.cache() on the original RDD, it will be in memory. If they call >> rdd.cache() on the compact RDD, it will be in memory.

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
09 AM, Сергей Лихоман wrote: > Thanks for answer! Could you please answer for one more question? Will we > have in memory original rdd and grouped rdd in the same time? > > 2015-07-19 21:04 GMT+03:00 Sandy Ryza : > >> Edit: the first line should read: >> >> val gr

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
This functionality already basically exists in Spark. To create the "grouped RDD", one can run: val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x => List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман wr

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza wrote: > This functionality already basically exists in Spark. To create the > "grouped RDD", one can run: > > val groupedRdd = rdd.red

Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-19 Thread Sandy Ryza
+1 On Sat, Jul 18, 2015 at 4:00 PM, Mridul Muralidharan wrote: > Thanks for detailing, definitely sounds better. > +1 > > Regards > Mridul > > On Saturday, July 18, 2015, Reynold Xin wrote: > >> A single commit message consisting of: >> >> 1. Pull request title (which includes JIRA number and c

Re: How to Read Excel file in Spark 1.4

2015-07-13 Thread Sandy Ryza
Hi Su, Spark can't read excel files directly. Your best best is probably to export the contents as a CSV and use the "csvFile" API. -Sandy On Mon, Jul 13, 2015 at 9:22 AM, spark user wrote: > Hi > > I need your help to save excel data in hive . > > >1. how to read excel file in spark usin

Re: External Shuffle service over yarn

2015-06-25 Thread Sandy Ryza
Hi Yash, One of the main advantages is that, if you turn dynamic allocation on, and executors are discarded, your application is still able to get at the shuffle data that they wrote out. -Sandy On Thu, Jun 25, 2015 at 11:08 PM, yash datta wrote: > Hi devs, > > Can someone point out if there a

Re: Increase partition count (repartition) without shuffle

2015-06-18 Thread Sandy Ryza
Hi Alexander, There is currently no way to create an RDD with more partitions than its parent RDD without causing a shuffle. However, if the files are splittable, you can set the Hadoop configurations that control split size to something smaller so that the HadoopRDD ends up with more partitions.

Re: [SparkScore] Performance portal for Apache Spark

2015-06-17 Thread Sandy Ryza
This looks really awesome. On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie wrote: > Hi All > > We are happy to announce Performance portal for Apache Spark > http://01org.github.io/sparkscore/ ! > > The Performance Portal for Apache Spark provides performance data on the > Spark upsteam to the com

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-05 Thread Sandy Ryza
+1 (non-binding) Built from source and ran some jobs against a pseudo-distributed YARN cluster. -Sandy On Fri, Jun 5, 2015 at 11:05 AM, Ram Sriharsha wrote: > +1 , tested with hadoop 2.6/ yarn on centos 6.5 after building w/ -Pyarn > -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftse

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-05-31 Thread Sandy Ryza
+1 (non-binding) Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0 and ran some jobs. -Sandy On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar wrote: > +1 (non-binding, of course) > > 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min > mvn clean package -Pyarn

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Sandy Ryza
Wow, I hadn't noticed this, but 5 seconds is really long. It's true that it's configurable, but I think we need to provide a decent out-of-the-box experience. For comparison, the MapReduce equivalent is 1 second. I filed https://issues.apache.org/jira/browse/SPARK-7533 for this. -Sandy On Mon,

Re: Regarding KryoSerialization in Spark

2015-04-30 Thread Sandy Ryza
Hi Twinkle, Registering the class makes it so that writeClass only writes out a couple bytes, instead of a full String of the class name. -Sandy On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva < twinkle.sachd...@gmail.com> wrote: > Hi, > > As per the code, KryoSerialization used writeClassAnd

Re: Using memory mapped file for shuffle

2015-04-29 Thread Sandy Ryza
, I need to store it as a byte buffer. I want to make > sure this will not cause OOM when the file size is large. > > > -- > Kannan > > On Tue, Apr 14, 2015 at 9:07 AM, Sandy Ryza > wrote: > >> Hi Kannan, >> >> Both in MapReduce and Spark, the amount

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Sandy Ryza
>>>>> aren't > >>>>> committers want to contribute changes (as opposed to just comments)? > >>>>> > >>>>> On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen > wrote: > >>>>> > >>>>>> On

Re: Design docs: consolidation and discoverability

2015-04-24 Thread Sandy Ryza
I think there are maybe two separate things we're talking about? 1. Design discussions and in-progress design docs. My two cents are that JIRA is the best place for this. It allows tracking the progression of a design across multiple PRs and contributors. A piece of useful feedback that I've go

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sandy Ryza
I think one of the benefits of assignee fields that I've seen in other projects is their potential to coordinate and prevent duplicate work. It's really frustrating to put a lot of work into a patch and then find out that someone has been doing the same. It's helpful for the project etiquette to

Re: Using memory mapped file for shuffle

2015-04-14 Thread Sandy Ryza
Hi Kannan, Both in MapReduce and Spark, the amount of shuffle data a task produces can exceed the tasks memory without risk of OOM. -Sandy On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid wrote: > That limit doesn't have anything to do with the amount of available > memory. Its just a tuning par

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Sandy Ryza
+1 Built against Hadoop 2.6 and ran some jobs against a pseudo-distributed YARN cluster. -Sandy On Wed, Apr 8, 2015 at 12:49 PM, Patrick Wendell wrote: > Oh I see - ah okay I'm guessing it was a transient build error and > I'll get it posted ASAP. > > On Wed, Apr 8, 2015 at 3:41 PM, Denny Lee

Re: RDD.count

2015-03-28 Thread Sandy Ryza
I definitely see the value in this. However, I think at this point it would be an incompatible behavioral change. People often use count in Spark to exercise their DAG. Omitting processing steps that were previously included would likely mislead many users into thinking their pipeline was runnin

Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza
Regarding Patrick's question, you can just do "new Configuration(oldConf)" to get a cloned Configuration object and add any new properties to it. -Sandy On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid wrote: > Hi Nick, > > I don't remember the exact details of these scenarios, but I think the use

Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
08 > > address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a > > elte: HSKSJZ (ZVZOAAI.ELTE) > > 2015-03-24 16:30 GMT+01:00 Sandy Ryza : > >> Hi Zoltan, >> >> If running on YARN, the YARN NodeManager starts executors. I don't think >> there's a 100% prec

Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
Hi Zoltan, If running on YARN, the YARN NodeManager starts executors. I don't think there's a 100% precise way for the Spark executor way to know how many resources are allotted to it. It can come close by looking at the Spark configuration options used to request it (spark.executor.memory and s

Re: Directly broadcasting (sort of) RDDs

2015-03-22 Thread Sandy Ryza
Hi Guillaume, I've long thought something like this would be useful - i.e. the ability to broadcast RDDs directly without first pulling data through the driver. If I understand correctly, your requirement to "block" a matrix up and only fetch the needed parts could be implemented on top of this b

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Sandy Ryza
+1 (non-binding, doc and packaging issues aside) Built from source, ran jobs and spark-shell against a pseudo-distributed YARN cluster. On Sun, Mar 8, 2015 at 2:42 PM, Krishna Sankar wrote: > Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop > Distributions X ... > > May

Re: multi-line comment style

2015-02-09 Thread Sandy Ryza
+1 to what Andrew said, I think both make sense in different situations and trusting developer discretion here is reasonable. On Mon, Feb 9, 2015 at 1:48 PM, Andrew Or wrote: > In my experience I find it much more natural to use // for short multi-line > comments (2 or 3 lines), and /* */ for lo

Re: Improving metadata in Spark JIRA

2015-02-06 Thread Sandy Ryza
JIRA updates don't go to this list, they go to iss...@spark.apache.org. I don't think many are signed up for that list, and those that are probably have a flood of emails anyway. So I'd definitely be in favor of any JIRA cleanup that you're up for. -Sandy On Fri, Feb 6, 2015 at 6:45 AM, Sean Ow

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Sandy Ryza
Both SchemaRDD and DataFrame sound fine to me, though I like the former slightly better because it's more descriptive. Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would be more clear from a user-facing perspective to at least choose a package name for it that omits "sql".

Re: Issue with repartition and cache

2015-01-21 Thread Sandy Ryza
Hi Dirceu, Does the issue not show up if you run "map(f => f(1).asInstanceOf[Int]).sum" on the "train" RDD? It appears that f(1) is an String, not an Int. If you're looking to parse and convert it, "toInt" should be used instead of "asInstanceOf". -Sandy On Wed, Jan 21, 2015 at 8:43 AM, Dirceu

Re: Semantics of LGTM

2015-01-17 Thread sandy . ryza
Yeah, the ASF +1 has become partly overloaded to mean both "I would like to see this feature" and "this patch should be committed", although, at least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should unambiguously mean the latter unless qualified in some other way. I d

Re: Semantics of LGTM

2015-01-17 Thread sandy . ryza
I think clarifying these semantics is definitely worthwhile. Maybe this complicates the process with additional terminology, but the way I've used these has been: +1 - I think this is safe to merge and, barring objections from others, would merge it immediately. LGTM - I have no concerns about

Re: Spark Dev

2014-12-19 Thread Sandy Ryza
Hi Harikrishna, A good place to start is taking a look at the wiki page on contributing: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -Sandy On Fri, Dec 19, 2014 at 2:43 PM, Harikrishna Kamepalli < harikrishna.kamepa...@gmail.com> wrote: > > i am interested to contribu

Re: one hot encoding

2014-12-13 Thread Sandy Ryza
Hi Lochana, We haven't yet added this in 1.2. https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical feature indexing, which one-hot encoding can be built on. https://issues.apache.org/jira/browse/SPARK-1216 also tracks a version of this prior to the ML pipelines work. -Sandy

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Sandy Ryza
+1 (non-binding). Tested on Ubuntu against YARN. On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin wrote: > +1 > > Tested on OS X. > > On Wednesday, December 10, 2014, Patrick Wendell > wrote: > > > Please vote on releasing the following candidate as Apache Spark version > > 1.2.0! > > > > The tag

Re: HA support for Spark

2014-12-10 Thread Sandy Ryza
I think that if we were able to maintain the full set of created RDDs as well as some scheduler and block manager state, it would be enough for most apps to recover. On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu wrote: > Well, it should not be mission impossible thinking there are so many HA > s

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-01 Thread Sandy Ryza
+1 (non-binding) built from source fired up a spark-shell against YARN cluster ran some jobs using parallelize ran some jobs that read files clicked around the web UI On Sun, Nov 30, 2014 at 1:10 AM, GuoQiang Li wrote: > +1 (non-binding‍) > > > > > -- Original -

Re: Too many open files error

2014-11-19 Thread Sandy Ryza
Quizhang, This is a known issue that ExternalAppendOnlyMap can do tons of tiny spills in certain situations. SPARK-4452 aims to deal with this issue, but we haven't finalized a solution yet. Dinesh's solution should help as a workaround, but you'll likely experience suboptimal performance when tr

Re: Spark & Hadoop 2.5.1

2014-11-14 Thread sandy . ryza
You're the second person to request this today. Planning to include this in my PR for Spark-4338. -Sandy > On Nov 14, 2014, at 8:48 AM, Corey Nolet wrote: > > In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like > you've mentioned. What prompted me to write this email wa

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Sandy Ryza
; that has scala-2.x profiles). > > > > On Wed, Nov 12, 2014 at 11:14 PM, Patrick Wendell > wrote: > > I actually do agree with this - let's see if we can find a solution > > that doesn't regress this behavior. Maybe we can simply move the one > > kafka exam

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Sandy Ryza
Currently there are no mandatory profiles required to build Spark. I.e. "mvn package" just works. It seems sad that we would need to break this. On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell wrote: > I think printing an error that says "-Pscala-2.10 must be enabled" is > probably okay. It'

Re: proposal / discuss: multiple Serializers within a SparkContext?

2014-11-08 Thread Sandy Ryza
I've seen Matei suggesting on jira > issues > > (or github) in the past a "storage policy" in which you can specify how > > data should be stored. I think that would be a great API to have in the > > long run. Designing it won't be trivial though. >

proposal / discuss: multiple Serializers within a SparkContext?

2014-11-07 Thread Sandy Ryza
Hey all, Was messing around with Spark and Google FlatBuffers for fun, and it got me thinking about Spark and serialization. I know there's been work / talk about in-memory columnar formats Spark SQL, so maybe there are ways to provide this flexibility already that I've missed? Either way, my th

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Sandy Ryza
It looks like the difference between the proposed Spark model and the CloudStack / SVN model is: * In the former, maintainers / partial committers are a way of centralizing oversight over particular components among committers * In the latter, maintainers / partial committers are a way of giving no

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Sandy Ryza
This seems like a good idea. An area that wasn't listed, but that I think could strongly benefit from maintainers, is the build. Having consistent oversight over Maven, SBT, and dependencies would allow us to avoid subtle breakages. Component maintainers have come up several times within the Had

Re: spark_classpath in core/pom.xml and yarn/porm.xml

2014-09-25 Thread Sandy Ryza
Hi Ye, I think git blame shows me because I fixed the formatting in core/pom.xml, but I don't actually know the original reason for setting SPARK_CLASSPATH there. Do the tests run OK if you take it out? -Sandy On Thu, Sep 25, 2014 at 1:59 AM, Ye Xianjin wrote: > hi, Sandy Ryza:

Re: A couple questions about shared variables

2014-09-23 Thread Sandy Ryza
Filed https://issues.apache.org/jira/browse/SPARK-3642 for documenting these nuances. -Sandy On Mon, Sep 22, 2014 at 10:36 AM, Nan Zhu wrote: > I see, thanks for pointing this out > > > -- > Nan Zhu > > On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote: > >

Re: A couple questions about shared variables

2014-09-22 Thread Sandy Ryza
) > > Best, > > -- > Nan Zhu > > On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote: > > Hey Sandy, > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) > wrote: > > Hey All, > > A couple questions came up about sh

Re: hash vs sort shuffle

2014-09-22 Thread Sandy Ryza
Thanks for the heads up Cody. Any indication of what was going wrong? On Mon, Sep 22, 2014 at 7:16 AM, Cody Koeninger wrote: > Just as a heads up, we deployed 471e6a3a of master (in order to get some > sql fixes), and were seeing jobs fail until we set > > spark.shuffle.manager=HASH > > I'd be

A couple questions about shared variables

2014-09-20 Thread Sandy Ryza
Hey All, A couple questions came up about shared variables recently, and I wanted to confirm my understanding and update the doc to be a little more clear. *Broadcast variables* Now that tasks data is automatically broadcast, the only occasions where it makes sense to explicitly broadcast are: *

Re: Spark authenticate enablement

2014-09-12 Thread Sandy Ryza
Hi Jun, I believe that's correct that Spark authentication only works against YARN. -Sandy On Thu, Sep 11, 2014 at 2:14 AM, Jun Feng Liu wrote: > Hi, there > > I am trying to enable the authentication on spark on standealone model. > Seems like only SparkSubmit load the properties from spark-d

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
Hmm, well I can't find it now, must have been hallucinating. Do you know off the top of your head where I'd be able to find the size to log it? On Thu, Sep 11, 2014 at 6:33 PM, Reynold Xin wrote: > I didn't know about that > > On Thu, Sep 11, 2014 at 6:29 PM, Sand

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
It used to be available on the UI, no? On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin wrote: > I don't think so. We should probably add a line to log it. > > > On Thursday, September 11, 2014, Sandy Ryza > wrote: > >> After the change to broadcast all task d

Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
After the change to broadcast all task data, is there any easy way to discover the serialized size of the data getting sent down for a task? thanks, -Sandy

Re: Lost executor on YARN ALS iterations

2014-09-10 Thread Sandy Ryza
That's right On Tue, Sep 9, 2014 at 2:04 PM, Debasish Das wrote: > Last time it did not show up on environment tab but I will give it another > shot...Expected behavior is that this env variable will show up right ? > > On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza > wrote: &

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
ices with say few billion ratings... > > On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza > wrote: > >> Hi Deb, >> >> The current state of the art is to increase >> spark.yarn.executor.memoryOverhead until the job stops failing. We do have >> plans to try to automatical

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
Hi Deb, The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic. -Sandy On Tue, Sep 9, 2014 at 7:32 AM, Debasish

Re: about spark assembly jar

2014-09-02 Thread Sandy Ryza
This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wrote: > Hi sean owen, > here are some problems when i used assembly jar > 1 i

Re: Lost executor on YARN ALS iterations

2014-08-20 Thread Sandy Ryza
Hi Debasish, The fix is to raise spark.yarn.executor.memoryOverhead until this goes away. This controls the buffer between the JVM heap size and the amount of memory requested from YARN (JVMs can take up memory beyond their heap size). You should also make sure that, in the YARN NodeManager confi

Re: Extra libs for bin/spark-shell - specifically for hbase

2014-08-16 Thread Sandy Ryza
Hi Stephen, Have you tried the --jars option (with jars separated by commas)? It should make the given jars available both to the driver and the executors. I believe one caveat currently is that if you give it a folder it won't pick up all the jars inside. -Sandy On Fri, Aug 15, 2014 at 4:07

Re: spark-shell is broken! (bad option: '--master')

2014-08-08 Thread Sandy Ryza
Hi Chutium, This is currently being addressed in https://github.com/apache/spark/pull/1825 -Sandy On Fri, Aug 8, 2014 at 2:26 PM, chutium wrote: > no one use spark-shell in master branch? > > i created a PR as follow up commit of SPARK-2678 and PR #1801: > > https://github.com/apache/spark/pu

Re: Fine-Grained Scheduler on Yarn

2014-08-08 Thread Sandy Ryza
: 2D barcode - encoded with contact information] *Phone: > *86-10-82452683 > > * E-mail:* *liuj...@cn.ibm.com* > [image: IBM] > > BLD 28,ZGC Software Park > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 > China > > > > > > *Sandy Ryza >* >

Re: Fine-Grained Scheduler on Yarn

2014-08-08 Thread Sandy Ryza
Hi Jun, Spark currently doesn't have that feature, i.e. it aims for a fixed number of executors per application regardless of resource usage, but it's definitely worth considering. We could start more executors when we have a large backlog of tasks and shut some down when we're underutilized. Th

Re: Fraud management system implementation

2014-07-28 Thread Sandy Ryza
+user list bcc: dev list It's definitely possible to implement credit fraud management using Spark. A good start would be using some of the supervised learning algorithms that Spark provides in MLLib (logistic regression or linear SVMs). Spark doesn't have any HMM implementation right now. Sean

Re: setting inputMetrics in HadoopRDD#compute()

2014-07-26 Thread Sandy Ryza
I'm working on a patch that switches this stuff out with the Hadoop FileSystem StatisticsData, which will both give an accurate count and allow us to get metrics while the task is in progress. A hitch is that it relies on https://issues.apache.org/jira/browse/HADOOP-10688, so we still might want a

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
first row for every file, or the header only for > the first file. The former is not really supported out of the box by the > input format I think? > > > On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza > wrote: > > > It could make sense to add a skipHeader argument to > Spark

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
It could make sense to add a skipHeader argument to SparkContext.textFile? On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin wrote: > If the purpose is for dropping csv headers, perhaps we don't really need a > common drop and only one that drops the first line in a file? I'd really > try hard to a

Re: Examples have SparkContext improperly labeled?

2014-07-21 Thread Sandy Ryza
Hi RJ, Spark Shell instantiates a SparkContext for you named "sc". In other apps, the user instantiates it themself and can give the variable whatever name they want, e.g. "spark". -Sandy On Mon, Jul 21, 2014 at 8:36 AM, RJ Nowling wrote: > Hi all, > > The examples listed here > > https://sp

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sandy Ryza
ror] import org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils > >>> > >>> [error]^ > >>> > >>> [error] > >>> > /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala:33: >

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Sandy Ryza
Stephen, Often the shuffle is bound by writes to disk, so even if disks have enough space to store the uncompressed data, the shuffle can complete faster by writing less data. Reynold, This isn't a big help in the short term, but if we switch to a sort-based shuffle, we'll only need a single LZFOu

Re: Changes to sbt build have been merged

2014-07-10 Thread Sandy Ryza
Woot! On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell wrote: > Just a heads up, we merged Prashant's work on having the sbt build read all > dependencies from Maven. Please report any issues you find on the dev list > or on JIRA. > > One note here for developers, going forward the sbt build w

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza
Having a common framework for clustering makes sense to me. While we should be careful about what algorithms we include, having solid implementations of minibatch clustering and hierarchical clustering seems like a worthwhile goal, and we should reuse as much code and APIs as reasonable. On Tue,

Re: Data Locality In Spark

2014-07-08 Thread Sandy Ryza
Hi Anish, Spark, like MapReduce, makes an effort to schedule tasks on the same nodes and racks that the input blocks reside on. -Sandy On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in < anishs...@yahoo.co.in> wrote: > Hi All > > My apologies for very basic question, do we have full suppo

Re: Contributing to MLlib on GLM

2014-06-17 Thread Sandy Ryza
Hi Xiaokai, I think MLLib is definitely interested in supporting additional GLMs. I'm not aware of anybody working on this at the moment. -Sandy On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei wrote: > Hi, > > I am an intern at PalantirTech and we are building some stuff on top of > MLlib. In P

Re: Please change instruction about "Launching Applications Inside the Cluster"

2014-05-30 Thread Sandy Ryza
They should be - in the sense that the docs now recommend using spark-submit and thus include entirely different invocations. On Fri, May 30, 2014 at 12:46 AM, Reynold Xin wrote: > Can you take a look at the latest Spark 1.0 docs and see if they are fixed? > > https://github.com/apache/spark/tr

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Sandy Ryza
+1 On Mon, May 26, 2014 at 7:38 AM, Tathagata Das wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.0.0! > > This has a few important bug fixes on top of rc10: > SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853 > SPARK-1870

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Sandy Ryza
; --- > > My Blog: https://www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza > wrote: > > > >> This will solve the issue for jars a

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Sandy Ryza
mary! We fixed it in branch 0.9 since our production is still in > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0 > >> tonight. > >> > >> > >> Sincerely, > >> > >> DB Tsai > >> ---

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Sandy Ryza
+1 On Tue, May 20, 2014 at 5:26 PM, Andrew Or wrote: > +1 > > > 2014-05-20 13:13 GMT-07:00 Tathagata Das : > > > Please vote on releasing the following candidate as Apache Spark version > > 1.0.0! > > > > This has a few bug fixes on top of rc9: > > SPARK-1875: https://github.com/apache/spark/pu

Re: Sorting partitions in Java

2014-05-20 Thread Sandy Ryza
There is: SPARK-545 On Tue, May 20, 2014 at 10:16 AM, Andrew Ash wrote: > Sandy, is there a Jira ticket for that? > > > On Tue, May 20, 2014 at 10:12 AM, Sandy Ryza >wrote: > > > sortByKey currently requires partitions to fit in memory, but there are > &

Re: Sorting partitions in Java

2014-05-20 Thread Sandy Ryza
sortByKey currently requires partitions to fit in memory, but there are plans to add external sort On Tue, May 20, 2014 at 10:10 AM, Madhu wrote: > Thanks Sean, I had seen that post you mentioned. > > What you suggest looks an in-memory sort, which is fine if each partition > is > small enough

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Sandy Ryza
ually > >>> > the ClassLoader that loaded whatever it is that first referenced Foo > >>> > and caused it to be loaded -- usually the ClassLoader holding your > >>> > other app classes. > >>> > > >>> > ClassLoaders can have a

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Sandy Ryza
s to create an object in that > way. Since the jars are already in distributed cache before the > executor starts, is there any reason we cannot add the locally cached > jars to classpath directly? > > Best, > Xiangrui > > On Sun, May 18, 2014 at 4:00 PM, Sandy Ryza > wrot

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Sandy Ryza
I spoke with DB offline about this a little while ago and he confirmed that he was able to access the jar from the driver. The issue appears to be a general Java issue: you can't directly instantiate a class from a dynamically loaded jar. I reproduced it locally outside of Spark with: --- URL

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-17 Thread Sandy Ryza
+1 Reran my tests from rc5: * Built the release from source. * Compiled Java and Scala apps that interact with HDFS against it. * Ran them in local mode. * Ran them against a pseudo-distributed YARN cluster in both yarn-client mode and yarn-cluster mode. On Sat, May 17, 2014 at 10:08 AM, Andrew

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Sandy Ryza
+1 (non-binding) * Built the release from source. * Compiled Java and Scala apps that interact with HDFS against it. * Ran them in local mode. * Ran them against a pseudo-distributed YARN cluster in both yarn-client mode and yarn-cluster mode. On Tue, May 13, 2014 at 9:09 PM, witgo wrote: > Yo

Re: Apache Spark running out of the spark shell

2014-05-03 Thread Sandy Ryza
Hi AJ, You might find this helpful - http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/ -Sandy On Sat, May 3, 2014 at 8:42 AM, Ajay Nair wrote: > Hi, > > I have written a code that works just about fine in the spark shell on EC2. > The ec2 script helped me co

Re: Any plans for new clustering algorithms?

2014-04-22 Thread Sandy Ryza
1, 2014 at 9:09 PM, Xiangrui Meng wrote: > > > >> The markdown files are under spark/docs. You can submit a PR for > >> changes. -Xiangrui > >> > >> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza > >> sandy.r...@cloudera.com)> wrote: &

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
are under spark/docs. You can submit a PR for > changes. -Xiangrui > > On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza > wrote: > > How do I get permissions to edit the wiki? > > > > > > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng wrote: > > > >>

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
ted (citations and > concrete use cases as examples of this), scalable and parallelizable, well > documented and with reasonable expectation of dev support > > > > Sent from my iPhone > > > >> On 21 Apr 2014, at 19:59, Sandy Ryza wrote: > >> > >> If it&#

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng wrote: > +1

  1   2   >