Re: Vector size mismatch in logistic regression - Spark ML 2.0
Hi, Just after I sent the mail, I realized that the error might be with the training-dataset not the test-dataset. 1. it might be that you are feeding the full Y vector for training. 2. Which could mean, you are using ~50-50 training-test split. 3. Take a good look at the code that does the data split and the datasets where they are allocated to. Cheers On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com> wrote: > Hi, > Looks like the test-dataset has different sizes for X & Y. Possible > steps: > >1. What is the test-data-size ? > - If it is 15,909, check the prediction variable vector - it is now > 29,471, should be 15,909 > - If you expect it to be 29,471, then the X Matrix is not right. > 2. It is also probable that the size of the test-data is something >else. If so, check the data pipeline. >3. If you print the count() of the various vectors, I think you can >find the error. > > Cheers & Good Luck > > > On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <janardhan...@gmail.com> > wrote: > >> Hi, >> >> I have built the logistic regression model using training-dataset. >> When I am predicting on a test-dataset, it is throwing the below error of >> size mismatch. >> >> Steps done: >> 1. String indexers on categorical features. >> 2. One hot encoding on these indexed features. >> >> Any help is appreciated to resolve this issue or is it a bug ? >> >> SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0 >> failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID >> 19421, localhost): java.lang.IllegalArgumentException: requirement failed: >> BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: >> x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224) >> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at >> org.apache.spark.ml.classification.LogisticRegressionModel$$ >> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml >> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504) >> at org.apache.spark.ml.classification.LogisticRegressionModel.p >> redictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.classifica >> tion.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484) at >> org.apache.spark.ml.classification.ProbabilisticClassificati >> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at >> org.apache.spark.ml.classification.ProbabilisticClassificati >> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe >> cificUnsafeProjection.evalExpr137$(Unknown Source) at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe >> cificUnsafeProjection.apply(Unknown Source) at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe >> cificUnsafeProjection.apply(Unknown Source) at >> scala.collection.Iterator$$anon$11.next(Iterator.scala:409) >> > >
Re: Vector size mismatch in logistic regression - Spark ML 2.0
Hi, Looks like the test-dataset has different sizes for X & Y. Possible steps: 1. What is the test-data-size ? - If it is 15,909, check the prediction variable vector - it is now 29,471, should be 15,909 - If you expect it to be 29,471, then the X Matrix is not right. 2. It is also probable that the size of the test-data is something else. If so, check the data pipeline. 3. If you print the count() of the various vectors, I think you can find the error. Cheers & Good Luck On Sun, Aug 21, 2016 at 3:16 PM, janardhan shettywrote: > Hi, > > I have built the logistic regression model using training-dataset. > When I am predicting on a test-dataset, it is throwing the below error of > size mismatch. > > Steps done: > 1. String indexers on categorical features. > 2. One hot encoding on these indexed features. > > Any help is appreciated to resolve this issue or is it a bug ? > > SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0 > failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID > 19421, localhost): java.lang.IllegalArgumentException: requirement failed: > BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: > x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at > org.apache.spark.ml.classification.LogisticRegressionModel$$ > anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml. > classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504) > at org.apache.spark.ml.classification.LogisticRegressionModel. > predictRaw(LogisticRegression.scala:594) at org.apache.spark.ml. > classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484) > at org.apache.spark.ml.classification.ProbabilisticClassificationMod > el$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at > org.apache.spark.ml.classification.ProbabilisticClassificationMod > el$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$ > SpecificUnsafeProjection.evalExpr137$(Unknown Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$ > SpecificUnsafeProjection.apply(Unknown Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$ > SpecificUnsafeProjection.apply(Unknown Source) at > scala.collection.Iterator$$anon$11.next(Iterator.scala:409) >
Re: Spark 1.6.2 version displayed as 1.6.1
This intrigued me as well. - Just for sure, I downloaded the 1.6.2 code and recompiled. - spark-shell and pyspark both show 1.6.2 as expected. Cheers On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Another possible explanation is that by accident you are still running > Spark 1.6.1. Which download are you using? This is what I see: > > $ ~/spark-1.6.2-bin-hadoop2.6/bin/spark-shell > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > more info. > Using Spark's repl log4j profile: > org/apache/spark/log4j-defaults-repl.properties > To adjust logging level use sc.setLogLevel("INFO") > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.2 > /_/ > > > On Mon, Jul 25, 2016 at 7:45 AM, Sean Owenwrote: > >> Are you certain? looks like it was correct in the release: >> >> >> https://github.com/apache/spark/blob/v1.6.2/core/src/main/scala/org/apache/spark/package.scala >> >> >> >> On Mon, Jul 25, 2016 at 12:33 AM, Ascot Moss >> wrote: >> > Hi, >> > >> > I am trying to upgrade spark from 1.6.1 to 1.6.2, from 1.6.2 >> spark-shell, I >> > found the version is still displayed 1.6.1 >> > >> > Is this a minor typo/bug? >> > >> > Regards >> > >> > >> > >> > ### >> > >> > Welcome to >> > >> > __ >> > >> > / __/__ ___ _/ /__ >> > >> > _\ \/ _ \/ _ `/ __/ '_/ >> > >> >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 >> > >> > /_/ >> > >> > >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreathwrote: > It seems likely that you're running into > https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the > test dataset in the train/test split contains users or items that were not > in the training set. Hence the model doesn't have computed factors for > those ids, and ALS 'transform' currently returns NaN for those ids. This in > turn results in NaN for the evaluator result. > > I have a PR open on that issue that will hopefully address this soon. > > > On Sun, 24 Jul 2016 at 17:49 VG wrote: > >> ping. Anyone has some suggestions/advice for me . >> It will be really helpful. >> >> VG >> On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: >> >>> Sean, >>> >>> I did this just to test the model. When I do a split of my data as >>> training to 80% and test to be 20% >>> >>> I get a Root-mean-square error = NaN >>> >>> So I am wondering where I might be going wrong >>> >>> Regards, >>> VG >>> >>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: >>> No, that's certainly not to be expected. ALS works by computing a much lower-rank representation of the input. It would not reproduce the input exactly, and you don't want it to -- this would be seriously overfit. This is why in general you don't evaluate a model on the training set. On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: > I am trying to run ml.ALS to compute some recommendations. > > Just to test I am using the same dataset for training using ALSModel and for > predicting the results based on the model . > > When I evaluate the result using RegressionEvaluator I get a > Root-mean-square error = 1.5544064263236066 > > I thin this should be 0. Any suggestions what might be going wrong. > > Regards, > Vipul >>> >>>
Thanks For a Job Well Done !!!
Hi all, Just wanted to thank all for the dataset API - most of the times we see only bugs in these lists ;o). - Putting some context, this weekend I was updating the SQL chapters of my book - it had all the ugliness of SchemaRDD, registerTempTable, take(10).foreach(println) and take(30).foreach(e=>println("%15s | %9.2f |".format(e(0),e(1 ;o) - I remember Hossein Falaki chiding me about the ugly println statements ! - Took me a little while to grok the dataset, sparksession, spark.read.option("header","true").option("inferSchema","true").csv(...) et al. - I am a big R fan and know the language pretty decent - so the constructs are familiar - Once I got it ( I am sure still there are more mysteries to uncover ...) it was just beautiful - well done folks !!! - One sees the contrast a lot better while teaching or writing books, because one has to think thru the old, the new and the transitional arc - I even remember the good old days when we were discussing whether Spark would get the dataframes like R at one of Paco's sessions ! - And now, it looks very decent for data wrangling. Cheers & keep up the good work P.S: My next chapter is the MLlib - need to convert to ml. Should be interesting ... I am a glutton for punishment - of the Spark kind, of course !
Re: Is Spark right for us?
Good question. It comes to computational complexity, computational scale and data volume. 1. If you can store the data in a single server or a small cluster of db server (say mysql) then hdfs/Spark might be an overkill 2. If you can run the computation/process the data on a single machine (remember servers with 512 GB memory and quad core CPUs can do a lot of stuff) then Spark is an overkill 3. Even if you can do computations #1 & #2 above, in a pipeline and tolerate the elapsed time, Spark might be an overkill 4. But if you require data/computation parallelism or distributed processing of data due to computation complexities or data volume or time constraints incl real time analytics, Spark is the right stack. 5. Taking a quick look at what you have described so far, probably Spark is not needed. Cheers & HTH On Sun, Mar 6, 2016 at 9:17 AM, Laumegui Deaulobi < guillaume.bilod...@gmail.com> wrote: > Our problem space is survey analytics. Each survey comprises a set of > questions, with each question having a set of possible answers. Survey > fill-out tasks are sent to users, who have until a certain date to complete > it. Based on these survey fill-outs, reports need to be generated. Each > report deals with a subset of the survey fill-outs, and comprises a set of > data points (average rating for question 1, min/max for question 2, etc.) > > We are dealing with rather large data sets - although reading the internet > we get the impression that everyone is analyzing petabytes of data... > > Users: up to 100,000 > Surveys: up to 100,000 > Questions per survey: up to 100 > Possible answers per question: up to 10 > Survey fill-outs / user: up to 10 > Reports: up to 100,000 > Data points per report: up to 100 > > Data is currently stored in a relational database but a migration to a > different kind of store is possible. > > The naive algorithm for report generation can be summed up as this: > > for each report to be generated { > for each report data point to be calculated { > calculate data point > add data point to report > } > publish report > } > > In order to deal with the upper limits of these values, we will need to > distribute this algorithm to a compute / data cluster as much as possible. > > I've read about frameworks such as Apache Spark but also Hadoop, GridGain, > HazelCast and several others, and am still confused as to how each of these > can help us and how they fit together. > > Is Spark the right framework for us? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: HDP 2.3 support for Spark 1.5.x
Thanks Guys. Yep, now I would install 1.5.1 over HDP 2.3, if that works. Cheers On Mon, Sep 28, 2015 at 9:47 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Krishna: > If you want to query ORC files, see the following JIRA: > > [SPARK-10623] [SQL] Fixes ORC predicate push-down > > which is in the 1.5.1. release. > > FYI > > On Mon, Sep 28, 2015 at 9:42 AM, Fabien Martin <fabien.marti...@gmail.com> > wrote: > >> Hi Krishna, >> >>- Take a lokk at >> *http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/ >> >> <http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/>* >>- Or you can specify your 1.5.x jar as the Spark one using something >>like : >> >> --conf >> "spark.yarn.jar=hdfs://master:8020/spark-assembly-1.5.0-hadoop2.6.0.jar" >> >> The main drawback is : >> >> *Known Issues* >> >> *Spark YARN ATS integration does not work in this tech preview. You will >> not see the history of Spark jobs in the Jobs server after a job is >> finished.* >> >> 2015-09-23 1:31 GMT+02:00 Zhan Zhang <zzh...@hortonworks.com>: >> >>> Hi Krishna, >>> >>> For the time being, you can download from upstream, and it should be >>> running OK for HDP2.3. For hdp specific problem, you can ask in >>> Hortonworks forum. >>> >>> Thanks. >>> >>> Zhan Zhang >>> >>> On Sep 22, 2015, at 3:42 PM, Krishna Sankar <ksanka...@gmail.com> wrote: >>> >>> Guys, >>> >>>- We have HDP 2.3 installed just now. It comes with Spark 1.3.x. The >>>current wisdom is that it will support the 1.4.x train (which is good, >>> need >>>DataFrame et al). >>>- What is the plan to support Spark 1.5.x ? Can we install 1.5.0 on >>>HDP 2.3 ? Or will Spark 1.5.x support be in HDP 2.3.x and if so ~when ? >>> >>> Cheers & Thanks >>> >>> >>> >>> >> >
Re: Spark MLib v/s SparkR
A few points to consider: a) SparkR gives the union of R_in_a_single_machine and the distributed_computing_of_Spark: b) It also gives the ability to wrangle with data in R, that is in the Spark eco system c) Coming to MLlib, the question is MLlib and R (not MLlib or R) - depending on the scale, data location et al d) As Ali mentioned, some of the MLlib might not be supported in R (I haven't looked at it that carefully, but can be resolved by the APIs), OTOH, 1.5 is on it's way. e) So it all depends on the algorithms that one wants to use and whether one needs R for pre or post processing HTH. Cheers k/ On Wed, Aug 5, 2015 at 11:24 AM, praveen S mylogi...@gmail.com wrote: I was wondering when one should go for MLib or SparkR. What is the criteria or what should be considered before choosing either of the solutions for data analysis? or What is the advantages of Spark MLib over Spark R or advantages of SparkR over MLib?
Re: Sum elements of an iterator inside an RDD
Looks like reduceByKey() should work here. Cheers k/ On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna leonida.gianfa...@gmail.com wrote: Thanks a lot oubrik, I got your point, my consideration is that sum() should be already a built-in function for iterators in python. Anyway I tried your approach def mysum(iter): count = sum = 0 for item in iter: count += 1 sum += item return sum wordCountsGrouped = wordsGrouped.groupByKey().map(lambda (w,iterator):(w,mysum(iterator))) print wordCountsGrouped.collect() but i get the error below, any idea? TypeError: unsupported operand type(s) for +=: 'int' and 'ResultIterable' at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) thx Leonida -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sum-elements-of-an-iterator-inside-an-RDD-tp23775p23778.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
import pyspark.sql.Row gives error in 1.4.1
Error - ImportError: No module named Row Cheers enjoy the long weekend k/
Re: making dataframe for different types using spark-csv
- use .cast(...).alias('...') after the DataFrame is read. - sql.functions.udf for any domain-specific conversions. Cheers k/ On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi experts! I am using spark-csv to lead csv data into dataframe. By default it makes type of each column as string. Is there some way to get dataframe of actual types like int,double etc.? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/making-dataframe-for-different-types-using-spark-csv-tp23570.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL built in functions
Interesting. Looking at the definitions, sql.functions.pow is defined only for (col,col). Just as an experiment, create a column with value 2 and see if that works. Cheers k/ On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro rcors...@gmail.com wrote: 1.4 and I did set the second parameter. The DSL works fine but trying out with SQL doesn't. On Mon, Jun 29, 2015, 4:32 PM Salih Oztop soz...@yahoo.com wrote: Hi Bob, I tested your scenario with Spark 1.3 and I assumed you did not miss the second parameter of pow(x,y) from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.jsonFile(/vagrant/people.json) # Displays the content of the DataFrame to stdout df.show() #These are all fine df.select(name, (df.age)*(df.age)).show() name(age * age) Michael null Andy900 Justin 361 df.select(name, (df.age)+1).show() name(age + 1) Michael null Andy31 Justin 20 However the following tests give the same error. df.select(name, pow(df.age,2)).show() ---TypeError Traceback (most recent call last)ipython-input-27-ce7299d3ef76 in module() 1 df.select(name, pow(df.age,2)).show() TypeError: unsupported operand type(s) for ** or pow(): 'Column' and 'int' df.select(name, (df.age)**2).show() ---TypeError Traceback (most recent call last)ipython-input-24-29540c3536bf in module() 1 df.select(name, (df.age)**2).show() TypeError: unsupported operand type(s) for ** or pow(): 'Column' and 'int' Moreover testing the functions individually they are working fine. pow(2,4) 16 2**4 16 Kind Regards Salih Oztop -- *From:* Bob Corsaro rcors...@gmail.com *To:* user user@spark.apache.org *Sent:* Monday, June 29, 2015 7:27 PM *Subject:* SparkSQL built in functions I'm having trouble using select pow(col) from table It seems the function is not registered for SparkSQL. Is this on purpose or an oversight? I'm using pyspark.
Re: Kmeans Labeled Point RDD
You can predict and then zip it with the points RDD to get approx. same as LP. Cheers k/ On Thu, May 21, 2015 at 6:19 PM, anneywarlord anneywarl...@gmail.com wrote: Hello, New to Spark. I wanted to know if it is possible to use a Labeled Point RDD in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I would like to be able to identify which observations were grouped with each centroid. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to download and add this. Cheers k/ On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer beancinemat...@gmail.com wrote: Afternoon all, I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package` The error is encountered when running spark shell via: `spark-shell --packages com.databricks:spark-csv_2.11:1.0.3` The full trace of the commands can be found at https://gist.github.com/momer/9d1ca583f9978ec9739d Not sure if I've done something wrong, or if the documentation is outdated, or...? Would appreciate any input or push in the right direction! Thank you, Mo
Re: Dataset announcement
Thanks Olivier. Good work. Interesting in more than one ways - including training, benchmarking, testing new releases et al. One quick question - do you plan to make it available as an S3 bucket ? Cheers k/ On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote: Dear Spark users, I would like to draw your attention to a dataset that we recently released, which is as of now the largest machine learning dataset ever released; see the following blog announcements: - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ - http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx The characteristics of this dataset are: - 1 TB of data - binary classification - 13 integer features - 26 categorical features, some of them taking millions of values. - 4B rows Hopefully this dataset will be useful to assess and push further the scalability of Spark and MLlib. Cheers, Olivier -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: IPyhon notebook command for spark need to be updated?
Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers k/ On Fri, Mar 20, 2015 at 3:45 PM, cong yue yuecong1...@gmail.com wrote: Hello : I tried ipython notebook with the following command in my enviroment. PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook --pylab inline ./bin/pyspark But it shows --pylab inline support is removed from ipython newest version. the log is as : --- $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook --pylab inline ./bin/pyspark [E 15:29:43.076 NotebookApp] Support for specifying --pylab on the command line has been removed. [E 15:29:43.077 NotebookApp] Please use `%pylab inline` or `%matplotlib inline` in the notebook itself. -- I am using IPython 3.0.0. and only IPython works in my enviroment. -- $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook --pylab inline ./bin/pyspark -- Does somebody have the same issue as mine? How do you solve it? Thanks, Cong - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: General Purpose Spark Cluster Hardware Requirements?
Without knowing the data size, computation storage requirements ... : - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine. Probably 5-10 machines. - Don't go for the most exotic machines, otoh don't go for cheapest ones either. - Find a sweet spot with your vendor i.e. if dual 6 cores are a lot cheaper than dual 10 cores then go with the less expensive ones. Same with disks - may be 2TB is a lot cheaper than 3 TB. - Decide if these are going to be storage intensive or compute intensive (I assume the latter) and configure accordingly - Make sure you can add storage to the machines - ie have free storage bays. - Or other way is to add more machines and buy the smaller speced machines. - Unless one has very firm I/O and compute requirements, I have found that FLOPS, and things of that nature, do not make that much sense. - Think in terms of RAM, CPU and storage - that is what will become the initial limitations. - Once there are enough production jobs, you can then figure out the FLOPS et al - 10 G network is a better choice, so price-in a 24-48 port TOR switch. - More concerned with the bandwidth between the cluster nodes, for shuffles et al Cheers k/ On Sun, Mar 8, 2015 at 2:29 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote: HI, I am going to submit a proposal to my University to setup my Standalone Spark Cluster, what hardware should i include in my proposal? I will be Working on classification (Spark MLlib) of Data streams (Spark Streams) If some body can fill up this answers, that will be great! Thanks *Cores *= (example 64 nodes, 1024 cores, your figures) ? *Performance**= (example= ~5.12TFlops, ~2TFlops, your figures) ___? *GPU*= YES/NO ___? *Fat Node* = YES/NO ___? *CPU Hrs/ Yr* = (example 2000, 8000, your figures) ___? *RAM/CPU* = (example 256GB, your figures) ___? * Storage Processing* = (example 200TB, your figures) ___? *Storage Output* = (example 5TB, 4TB HHD/SSD, your figures) ___? *Most processors today carryout 4 FLOPS per cycle, thus a single-core 2.5 GHz processor has a theoretical performance of 10 billion FLOPS = 10GFLOPS Note:I Need a *general purpose* cluster, not very high end nor very low specs. It will not be dedicated to just one project i guess. You people already have experience in setting up clusters, that's the reason i posted it here :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/General-Purpose-Spark-Cluster-Hardware-Requirements-tp21963.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Movie Recommendation tutorial
Yep, much better with 0.1. The best model was trained with rank = 12 and lambda = 0.1, and numIter = 20, and its RMSE on the test set is 0.869092 (Spark 1.3.0) Question : What is the intuition behind RSME of 0.86 vs 1.3 ? I know the smaller the better. But is it that better ? And what is a good number for a recommendation engine ? Cheers k/ On Tue, Feb 24, 2015 at 1:03 AM, Guillaume Charhon guilla...@databerries.com wrote: I am using Spark 1.2.1. Thank you Krishna, I am getting almost the same results as you so it must be an error in the tutorial. Xiangrui, I made some additional tests with lambda to 0.1 and I am getting a much better rmse: RMSE (validation) = 0.868981 for the model trained with rank = 8, lambda = 0.1, and numIter = 10. RMSE (validation) = 0.869628 for the model trained with rank = 8, lambda = 0.1, and numIter = 20. RMSE (validation) = 1.361321 for the model trained with rank = 8, lambda = 1.0, and numIter = 10. RMSE (validation) = 1.361321 for the model trained with rank = 8, lambda = 1.0, and numIter = 20. RMSE (validation) = 3.755870 for the model trained with rank = 8, lambda = 10.0, and numIter = 10. RMSE (validation) = 3.755870 for the model trained with rank = 8, lambda = 10.0, and numIter = 20. RMSE (validation) = 0.866605 for the model trained with rank = 12, lambda = 0.1, and numIter = 10. RMSE (validation) = 0.867498 for the model trained with rank = 12, lambda = 0.1, and numIter = 20. RMSE (validation) = 1.361321 for the model trained with rank = 12, lambda = 1.0, and numIter = 10. RMSE (validation) = 1.361321 for the model trained with rank = 12, lambda = 1.0, and numIter = 20. RMSE (validation) = 3.755870 for the model trained with rank = 12, lambda = 10.0, and numIter = 10. RMSE (validation) = 3.755870 for the model trained with rank = 12, lambda = 10.0, and numIter = 20. The best model was trained with rank = 12 and lambda = 0.1, and numIter = 10, and its RMSE on the test set is 0.865407. On Tue, Feb 24, 2015 at 7:23 AM, Xiangrui Meng men...@gmail.com wrote: Try to set lambda to 0.1. -Xiangrui On Mon, Feb 23, 2015 at 3:06 PM, Krishna Sankar ksanka...@gmail.com wrote: The RSME varies a little bit between the versions. Partitioned the training,validation,test set like so: training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6) validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and (x[3] % 10) 8) test = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 8) Validation MSE : # 1.3.0 Mean Squared Error = 0.871456869392 # 1.2.1 Mean Squared Error = 0.877305629074 Itertools results: 1.3.0 - RSME = 1.354839 (rank = 8 and lambda = 1.0, and numIter = 20) 1.1.1 - RSME = 1.335831 (rank = 8 and lambda = 1.0, and numIter = 10) Cheers k/ On Mon, Feb 23, 2015 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Which Spark version did you use? Btw, there are three datasets from MovieLens. The tutorial used the medium one (1 million). -Xiangrui On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com wrote: What do you mean? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769p21771.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Movie Recommendation tutorial
1. The RSME varies a little bit between the versions. 2. Partitioned the training,validation,test set like so: - training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6) - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and (x[3] % 10) 8) - test = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 8) - Validation MSE : - - # 1.3.0 Mean Squared Error = 0.871456869392 - # 1.2.1 Mean Squared Error = 0.877305629074 3. Itertools results: - 1.3.0 - RSME = 1.354839 (rank = 8 and lambda = 1.0, and numIter = 20) - 1.1.1 - RSME = 1.335831 (rank = 8 and lambda = 1.0, and numIter = 10) Cheers k/ On Mon, Feb 23, 2015 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Which Spark version did you use? Btw, there are three datasets from MovieLens. The tutorial used the medium one (1 million). -Xiangrui On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com wrote: What do you mean? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769p21771.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: randomSplit instead of a huge map reduce ?
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair being the key) would work - looks like a mapReduce with combiners problem. I think reduceByKey would use combiners while aggregateByKey wouldn't. - Could we optimize this further by using combineByKey directly ? Cheers k/ On Fri, Feb 20, 2015 at 6:39 PM, Ashish Rangole arang...@gmail.com wrote: Is there a check you can put in place to not create pairs that aren't in your set of 20M pairs? Additionally, once you have your arrays converted to pairs you can do aggregateByKey with each pair being the key. On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote: Hi, I am new to Spark and I think I missed something very basic. I have the following use case (I use Java and run Spark locally on my laptop): I have a JavaRDDString[] - The RDD contains around 72,000 arrays of strings (String[]) - Each array contains 80 words (on average). What I want to do is to convert each array into a new array/list of pairs, for example: Input: String[] words = ['a', 'b', 'c'] Output: List[String, Sting] pairs = [('a', 'b'), (a', 'c'), (b', 'c')] and then I want to count the number of times each pair appeared, so my final output should be something like: Output: List[String, Sting, Integer] result = [('a', 'b', 3), (a', 'c', 8), (b', 'c', 10)] The problem: Since each array contains around 80 words, it returns around 3,200 pairs, so after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs to reduce which require way too much memory. (I know I have only around *20,000,000* unique pairs!) I already modified my code and used 'mapPartitions' instead of 'map'. It definitely improved the performance, but I still feel I'm doing something completely wrong. I was wondering if this is the right 'Spark way' to solve this kind of problem, or maybe I should do something like splitting my original RDD into smaller parts (by using randomSplit), then iterate over each part, aggregate the results into some result RDD (by using 'union') and move on to the next part. Can anyone please explain me which solution is better? Thank you very much, Shlomi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-shell working in scala-2.11
Stephen, Scala 2.11 worked fine for me. Did the dev change and then compile. Not using in production, but I go back and forth between 2.10 2.11. Cheers k/ On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Hey, I recently compiled Spark master against scala-2.11 (by running the dev/change-versions script), but when I run spark-shell, it looks like the sc variable is missing. Is this a known/unknown issue? Are others successfully using Spark with scala-2.11, and specifically spark-shell? It is possible I did something dumb while compiling master, but I'm not sure what it would be. Thanks, Stephen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[no subject]
Guys, registerTempTable(Employees) gives me the error Exception in thread main scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [/Applications/eclipse/plugins/org.scala-lang.scala-library_2.11.4.v20141023-110636-d783face36.jar:/Applications/eclipse/plugins/org.scala-lang.scala-reflect_2.11.4.v20141023-110636-d783face36.jar:/Applications/eclipse/plugins/org.scala-lang.scala-actors_2.11.4.v20141023-110636-d783face36.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/classes] not found. Probably something obvious I am missing. Everything else works fine, so far. Any easy fix ? Cheers k/
Re: DeepLearning and Spark ?
I am also looking at this domain. We could potentially use the broadcast capability in Spark to distribute the parameters. Haven't thought thru yet. Cheers k/ On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote: Does it makes sense to use Spark's actor system (e.g. via SparkContext.env.actorSystem) to create parameter server? On Fri, Jan 9, 2015 at 10:09 PM, Peng Cheng rhw...@gmail.com wrote: You are not the first :) probably not the fifth to have the question. parameter server is not included in spark framework and I've seen all kinds of hacking to improvise it: REST api, HDFS, tachyon, etc. Not sure if an 'official' benchmark implementation will be released soon On 9 January 2015 at 10:59, Marco Shaw marco.s...@gmail.com wrote: Pretty vague on details: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199 On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, DeepLearning algorithms are popular and achieve many state of the art performance in several real world machine learning problems. Currently there are no DL implementation in spark and I wonder if there is an ongoing work on this topics. We can do DL in spark Sparkling water and H2O but this adds an additional software stack. Deeplearning4j seems to implements a distributed version of many popural DL algorithm. Porting DL4j in Spark can be interesting. Google describes an implementation of a large scale DL in this paper http://research.google.com/archive/large_deep_networks_nips2012.html. Based on model parallelism and data parallelism. So, I'm trying to imaging what should be a good design for DL algorithm in Spark ? Spark already have RDD (for data parallelism). Can GraphX be used for the model parallelism (as DNN are generally designed as DAG) ? And what about using GPUs to do local parallelism (mecanism to push partition into GPU memory ) ? What do you think about this ? Cheers, Jao
Re: Re: I think I am almost lost in the internals of Spark
Interestingly Google Chrome translates the materials. Cheers k/ On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas vcsub...@gmail.com wrote: I do not understand Chinese but the diagrams on that page are very helpful. On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote: A good beginning if you are chinese. https://github.com/JerryLead/SparkInternals/tree/master/markdown 2015-01-07 10:13 GMT+08:00 bit1...@163.com bit1...@163.com: Thank you, Tobias. I will look into the Spark paper. But it looks that the paper has been moved, http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf. A web page is returned (Resource not found)when I access it. -- bit1...@163.com *From:* Tobias Pfeiffer t...@preferred.jp *Date:* 2015-01-07 09:24 *To:* Todd bit1...@163.com *CC:* user user@spark.apache.org *Subject:* Re: I think I am almost lost in the internals of Spark Hi, On Tue, Jan 6, 2015 at 11:24 PM, Todd bit1...@163.com wrote: I am a bit new to Spark, except that I tried simple things like word count, and the examples given in the spark sql programming guide. Now, I am investigating the internals of Spark, but I think I am almost lost, because I could not grasp a whole picture what spark does when it executes the word count. I recommend understanding what an RDD is and how it is processed, using http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds and probably also http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf (once the server is back). Understanding how an RDD is processed is probably most helpful to understand the whole of Spark. Tobias -- 王海华
Re: Spark for core business-logic? - Replacing: MongoDB?
Alec, Good questions. Suggestions: 1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer, Cache, Queue, App Server, App (Interface), App (backend ML) et al. 2. Then slot-in the appropriate technologies - may be even multiple technologies for the same layer and then work thru the pros and cons. 3. Looking at the layers (moving from the easy to difficult, the mundane to the esoteric ;o)): - Cache Queue - stick with what you are comfortable with ie Redis et al. Also take a look at Kafka - App Server - Tomcat et al - App (Interface) - JavaScript et al - DB, SQL Layer - Better off with with MongoDB. You can explore HBase, but it is not the same. - The same way as MongoDB != mySQL, HBase != MongoDB - Machine Learning Server/Layer - Spark would fit very well here. - Machine Learning DFS, Data Store - HDFS - The idea of pushing the data to Hadoop for ML is good - But you need to think thru things like incremental data load, semantics like at least once, at most once et al. 4. You could architect all with the Hadoop eco system. It might work, depending on the system. - But I would use caution. Most probably many of the elements would rather be implemented in appropriate technologies. 5. Doubleclick couple more times on the design, think thru the functionality, scaling requirements et al - Draw 3 or 4 alternatives and jot down the top 5 requirements, pros and cons, the knowns and the unknowns - The optimum design will fall thru Cheers k/ On Sat, Jan 3, 2015 at 4:43 PM, Alec Taylor alec.tayl...@gmail.com wrote: In the middle of doing the architecture for a new project, which has various machine learning and related components, including: recommender systems, search engines and sequence [common intersection] matching. Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue, backed by Redis). Though I don't have experience with Hadoop, I was thinking of using Hadoop for the machine-learning (as this will become a Big Data problem quite quickly). To push the data into Hadoop, I would use a connector of some description, or push the MongoDB backups into HDFS at set intervals. However I was thinking that it might be better to put the whole thing in Hadoop, store all persistent data in Hadoop, and maybe do all the layers in Apache Spark (with caching remaining in Redis). Is that a viable option? - Most of what I see discusses Spark (and Hadoop in general) for analytics only. Apache Phoenix exposes a nice interface for read/write over HBase, so I might use that if Spark ends up being the wrong solution. Thanks for all suggestions, Alec Taylor PS: I need this for both Big and Small data. Note that I am using the Cloudera definition of Big Data referring to processing/storage across more than 1 machine. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Calling ALS-MlLib from desktop application/ Training ALS
a) There is no absolute RSME - it depends on the domain. Also RSME is the error based on what you have seen so far, a snapshot of a slice of the domain. b) My suggestion is put the system in place, see what happens when users interact with the system and then you can think of reducing the RSME as needed. For all you know, RSME could go up with another set of data c) I would prefer Scala, but Java would work as well. d) For a desktop app, you have two ways to go. Either run Spark in local machine and build an app or Have Spark run in a server/cluster and build a browser app. This depends on the data size and scaling requirements. e) I haven't seen any C# interfaces. Might be a good request candidate. Cheers k/ On Sat, Dec 13, 2014 at 7:17 PM, Saurabh Agrawal saurabh.agra...@markit.com wrote: Requesting guidance on my queries in trail email. -Original Message- *From: *Saurabh Agrawal *Sent: *Saturday, December 13, 2014 07:06 PM GMT Standard Time *To: *user@spark.apache.org *Subject: *Building Desktop application for ALS-MlLib/ Training ALS Hi, I am a new bee in spark and scala world I have been trying to implement Collaborative filtering using MlLib supplied out of the box with Spark and Scala I have 2 problems 1. The best model was trained with rank = 20 and lambda = 5.0, and numIter = 10, and its RMSE on the test set is 25.718710831912485. The best model improves the baseline by 18.29%. Is there a scientific way in which RMSE could be brought down? What is a descent acceptable value for RMSE? 2. I picked up the Collaborative filtering algorithm from http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html and executed the given code with my dataset. Now, I want to build a desktop application around it. a. What is the best language to do this Java/ Scala? Any possibility to do this using C#? b. Can somebody please share any relevant documents/ source or any helper links to help me get started on this? Your help is greatly appreciated Thanks!! Regards, Saurabh Agrawal -- This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide. MarkitSERV Limited has its registered office located at Level 4, Ropemaker Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by the Financial Conduct Authority with registration number 207294
Re: Spark or MR, Scala or Java?
Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not apparent to the framework. Cheers k/ P.S: Now Reply to ALL On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote: Java or Scala : I knew Java already yet I learnt Scala when I came across Spark. As others have said, you can get started with a little bit of Scala and learn more as you progress. Once you have started using Scala for a few weeks you would want to stay with it instead of going back to Java. Scala is arguably more elegant and less verbose than Java which translates into higher developer productivity and more maintainable code. Scala is arguably more elegant and less verbose than Java. However, Scala is also a complex language with a lot of details and tidbits and one-offs that you just have to remember. It is sometimes difficult to make a decision whether what you wrote is the using the language features most effectively or if you missed out on an available feature that could have made the code better or more concise. For Spark you really do not need to know that much Scala but you do need to understand the essence of it. Thanks for the good discussion! :-) Ognen
Re: Spark or MR, Scala or Java?
A very timely article http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/ Cheers k/ P.S: Now reply to ALL. On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote: Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not apparent to the framework. Cheers k/ P.S: Now Reply to ALL On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote: Java or Scala : I knew Java already yet I learnt Scala when I came across Spark. As others have said, you can get started with a little bit of Scala and learn more as you progress. Once you have started using Scala for a few weeks you would want to stay with it instead of going back to Java. Scala is arguably more elegant and less verbose than Java which translates into higher developer productivity and more maintainable code. Scala is arguably more elegant and less verbose than Java. However, Scala is also a complex language with a lot of details and tidbits and one-offs that you just have to remember. It is sometimes difficult to make a decision whether what you wrote is the using the language features most effectively or if you missed out on an available feature that could have made the code better or more concise. For Spark you really do not need to know that much Scala but you do need to understand the essence of it. Thanks for the good discussion! :-) Ognen
Re: Spark or MR, Scala or Java?
Adding to already interesting answers: - Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark? - Many. MR would be better (am not saying faster ;o)) for - Very large dataset, - Multistage map-reduce flows, - Complex map-reduce semantics - Spark is definitely better for the classic iterative,interactive workloads. - Spark is very effective for implementing the concepts of in-memory datasets real time analytics - Take a look at the Lambda architecture - Also checkout how Ooyala is using Spark in multiple layers configurations. They also have MR in many places - In our case, we found Spark very effective for ELT - we would have used MR earlier - I know Java, is it worth it to learn Scala for programming to Spark or it's okay just with Java? - Java will work fine. Especially when Java 8 becomes the norm, we will get back some of the elegance - I, personally, like Scala Python lot better than Java. Scala is a lot more elegant, but compilations, IDE integration et al are still clunky - One word of caution - stick with one language as much as possible-shuffling between Java Scala is not fun Cheers HTH k/ On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote: MapReduce is simpler and narrower, which also means it is generally lighter weight, with less to know and configure, and runs more predictably. If you have a job that is truly just a few maps, with maybe one reduce, MR will likely be more efficient. Until recently its shuffle has been more developed and offers some semantics the Spark shuffle does not. I suppose it integrates with tools like Oozie, that Spark does not. I suggest learning enough Scala to use Spark in Scala. The amount you need to know is not large. (Mahout MR based implementations do not run on Spark and will not. They have been removed instead.) On Nov 22, 2014 3:36 PM, Guillermo Ortiz konstt2...@gmail.com wrote: Hello, I'm a newbie with Spark but I've been working with Hadoop for a while. I have two questions. Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark? The other question is, I know Java, is it worth it to learn Scala for programming to Spark or it's okay just with Java? I have done a little piece of code with Java because I feel more confident with it,, but I seems that I'm missed something - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Breaking the previous large-scale sort record with Spark
Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers k/ P.S: What are you folks planning next ? On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html. Summary: while Hadoop MapReduce held last year's 100 TB world record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes. I want to thank Reynold Xin for leading this effort over the past few weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for providing the machines to make this possible. Finally, this result would of course not be possible without the many many other contributions, testing and feature requests from throughout the community. For an engine to scale from these multi-hour petabyte batch jobs down to 100-millisecond streaming and interactive queries is quite uncommon, and it's thanks to all of you folks that we are able to make this happen. Matei - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can not see any spark metrics on ganglia-web
Hi, I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia. This step only adds the support for Hadoop,Yarn,Hive et al in the spark executable.No need to run if one is not using them. Cheers k/ On Thu, Oct 2, 2014 at 12:29 PM, danilopds danilob...@gmail.com wrote: Hi tsingfu, I want to see metrics in ganglia too. But I don't understand this step: ./make-distribution.sh --tgz --skip-java-test -Phadoop-2.3 -Pyarn -Phive -Pspark-ganglia-lgpl Are you installing the hadoop, yarn, hive AND ganglia?? If I want to install just ganglia? Can you suggest me something? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p15631.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib Linear Regression Mismatch
Thanks Burak. Step size 0.01 worked for b) and step=0.0001 for c) ! Cheers k/ On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz bya...@stanford.edu wrote: Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01? Best, Burak - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subject: MLlib Linear Regression Mismatch Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) print lrm print lrm.weights print lrm.intercept lrm.predict([40]) output: pyspark.mllib.regression.LinearRegressionModel object at 0x109813d50 [ 1.] 0.0 40.0 b) By perturbing the y a little bit, the model gives wrong results: data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(9.0, [10.0]), LabeledPoint(22.0, [20.0]), LabeledPoint(32.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be 1.09x -0.60 print lrm print lrm.weights print lrm.intercept lrm.predict([40]) Output: pyspark.mllib.regression.LinearRegressionModel object at 0x109666590 [ -8.20487463e+203] 0.0 -3.2819498532740317e+205 c) Same story here - wrong results. Actually nan: data = [ LabeledPoint(18.9, [3910.0]), LabeledPoint(17.0, [3860.0]), LabeledPoint(20.0, [4200.0]), LabeledPoint(16.6, [3660.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170 print lrm print lrm.weights print lrm.intercept lrm.predict([4000]) Output:pyspark.mllib.regression.LinearRegressionModel object at 0x109666b90 [ nan] 0.0 nan Cheers Thanks k/
Re: MLlib 1.2 New Interesting Features
Thanks Xiangrui. Appreciate the insights. I have uploaded the initial version of my presentation at http://goo.gl/1nBD8N Cheers k/ On Mon, Sep 29, 2014 at 12:17 AM, Xiangrui Meng men...@gmail.com wrote: Hi Krishna, Some planned features for MLlib 1.2 can be found via Spark JIRA: http://bit.ly/1ywotkm , though this list is not fixed. The feature freeze will happen by the end of Oct. Then we will cut branch-1.2 and start QA. I don't recommend using branch-1.2 for hands-on tutorial around Oct 29th because that branch is not full tested at that time. You should use 1.1 instead. Its binary packages and documentation can be easily found on spark.apache.org, which is important for making hands-on tutorial. Best, Xiangrui On Sat, Sep 27, 2014 at 12:15 PM, Krishna Sankar ksanka...@gmail.com wrote: Guys, Need help in terms of the interesting features coming up in MLlib 1.2. I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con The Hitchhiker's Guide to Machine Learning with Python Apache Spark[2] At minimum, it would be good to take the last 30 min to elaborate the new features in MLlib coming up in 1.2. If the features are stable, I might use 1.2 for the tutorial. It is hands-on, so want to use a stable Spark version. What are the salient ML features slated to be part of 1.2 ? Which branch should I look at ? Will it be stable enough by Oct 29th for the attendees to download ? Then I can plan the materials around it. Cheers k/ [1] My two talks : http://www.bigdatatechcon.com/speakers.html#KrishnaSankar [2] Spark Talk : http://goo.gl/4Pcvuq
MLlib 1.2 New Interesting Features
Guys, - Need help in terms of the interesting features coming up in MLlib 1.2. - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con - The Hitchhiker's Guide to Machine Learning with Python Apache Spark[2] - At minimum, it would be good to take the last 30 min to elaborate the new features in MLlib coming up in 1.2. - If the features are stable, I might use 1.2 for the tutorial. - It is hands-on, so want to use a stable Spark version. 1. What are the salient ML features slated to be part of 1.2 ? 2. Which branch should I look at ? 3. Will it be stable enough by Oct 29th for the attendees to download ? Then I can plan the materials around it. Cheers k/ [1] My two talks : http://www.bigdatatechcon.com/speakers.html#KrishnaSankar [2] Spark Talk : http://goo.gl/4Pcvuq
Re: Out of any idea
Probably you have - if not, try a very simple app in the docker container and make sure it works. Sometimes resource contention/allocation can get in the way. This happened to me in the YARN container. Also try single worker thread. Cheers k/ On Sat, Jul 19, 2014 at 2:39 PM, boci boci.b...@gmail.com wrote: Hi guys! I run out of ideas... I created a spark streaming job (kafka - spark - ES). If I start my app local machine (inside the editor, but connect to the real kafka and ES) the application work correctly. If I start it in my docker container (same kafka and ES, local mode (local[4]) like inside my editor) the application connect to kafka, receive the message but after that nothing happened (I put config/log4j.properties to debug mode and I see BlockGenerator receive the data bu after that nothing happened with that. (first step I simply run a map to print the received data with log4j) I hope somebody can help... :( b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com
Re: Need help on spark Hbase
One vector to check is the HBase libraries in the --jars as in : spark-submit --class your class --master master url --jars hbase-client-0.98.3-hadoop2.jar,commons-csv-1.0-SNAPSHOT.jar,hbase-common-0.98.3-hadoop2.jar,hbase-hadoop2-compat-0.98.3-hadoop2.jar,hbase-it-0.98.3-hadoop2.jar,hbase-protocol-0.98.3-hadoop2.jar,hbase-server-0.98.3-hadoop2.jar,htrace-core-2.04.jar,spark-assembly-1.0.0-hadoop2.2.0.jar badwclient.jar This worked for us. Cheers k/ On Tue, Jul 15, 2014 at 6:47 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, Could you please help me to resolve the issue. *Issue *: I'm not able to connect HBase from Spark-submit. Below is my code. When i execute below program in standalone, i'm able to connect to Hbase and doing the operation. When i execute below program using spark submit ( ./bin/spark-submit ) command, i'm not able to connect to hbase. Am i missing any thing? import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Put; import org.apache.log4j.Logger; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.streaming.Duration; import org.apache.spark.streaming.api.java.JavaDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.HBaseAdmin; public class Test { public static void main(String[] args) throws Exception { JavaStreamingContext ssc = new JavaStreamingContext(local,Test, new Duration(4), sparkHome, ); JavaDStreamString lines_2 = ssc.textFileStream(hdfsfolderpath); Configuration configuration = HBaseConfiguration.create(); configuration.set(hbase.zookeeper.property.clientPort, 2181); configuration.set(hbase.zookeeper.quorum, localhost); configuration.set(hbase.master, localhost:60); HBaseAdmin hBaseAdmin = new HBaseAdmin(configuration); if (hBaseAdmin.tableExists(HABSE_TABLE)) { System.out.println( ANA_DATA table exists ..); } System.out.println( HELLO HELLO HELLO ); ssc.start(); ssc.awaitTermination(); } } Thank you for your help and support. Regards, Rajesh
Re: Requirements for Spark cluster
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in all the nodes irrespective of Hadoop/YARN. Cheers k/ On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote: I have a Spark app which runs well on local master. I'm now ready to put it on a cluster. What needs to be installed on the master? What needs to be installed on the workers? If the cluster already has Hadoop or YARN or Cloudera, does it still need an install of Spark?
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should have hdfs,yarn spark installed on all the nodes. 4. After installations, check the web console to make sure hdfs, yarn spark are running. 5. Then you are ready to start experimenting/developing spark applications. HTH. Cheers k/ On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: guys, I'm not talking about running spark on VM, I don have problem with it. I confused in the next: 1) Hortonworks describe installation process as RPMs on each node 2) spark home page said that everything I need is YARN And I'm in stucj with understanding what I need to do to run spark on yarn (do I need RPMs installations or only build spark on edge node?) Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 4:34 AM, Robert James srobertja...@gmail.com wrote: I can say from my experience that getting Spark to work with Hadoop 2 is not for the beginner; after solving one problem after another (dependencies, scripts, etc.), I went back to Hadoop 1. Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure why, but, given so, Hadoop 2 has too many bumps On 7/6/14, Marco Shaw marco.s...@gmail.com wrote: That is confusing based on the context you provided. This might take more time than I can spare to try to understand. For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. Cloudera's CDH 5 express VM includes Spark, but the service isn't running by default. I can't remember for MapR... Marco On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Marco, Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf HDP 2.1 means YARN, at the same time they propose ti install rpm On other hand, http://spark.apache.org/ said Integrated with Hadoop Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. If you have a Hadoop 2 cluster, you can run Spark without any installation needed. And this is confusing for me... do I need rpm installation on not?... Thank you, Konstantin Kudryavtsev On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote: Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application application_1404470405736_0044 failed 3 times due to AM Container for appattempt_1404470405736_0044_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
Re: Spark Processing Large Data Stuck
Hi, - I have seen similar behavior before. As far as I can tell, the root cause is the out of memory error - verified this by monitoring the memory. - I had a 30 GB file and was running on a single machine with 16GB. So I knew it would fail. - But instead of raising an exception, some part of the system keeps on churning. - My suggestion is to follow the memory settings for the JVM (try bigger settings), make sure the settings are propagated to all the workers and finally monitor the memory while the job is running. - Another vector is to split the file, try with progressively increasing size. - I also see symptoms of failed connections. While I can't positively say that it is a problem, check your topology network connectivity. - Out of curiosity, what kind of machines are you running ? Bare metal ? EC2 ? How much memory ? 64 bit OS ? - I assume these are big machines and so the resources themselves might not be a problem. Cheers k/ On Sat, Jun 21, 2014 at 12:55 PM, yxzhao yxz...@ualr.edu wrote: I run the pagerank example processing a large data set, 5GB in size, using 48 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the attached log shows. It was stuck there for more than 10 hours and then I killed it at last. But I did not find any information explaining why it was stuck. Any suggestions? Thanks. Spark_OK_48_pagerank.log http://apache-spark-user-list.1001560.n3.nabble.com/file/n8075/Spark_OK_48_pagerank.log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark streaming RDDs to Parquet records
Mahesh, - One direction could be : create a parquet schema, convert save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc mahesh.padmanab...@twc-contractor.com wrote: Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL context val sc = ds.context.sparkContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) // For each RDD ds.foreachRDD((rdd: RDD[Array[Byte]]) = { // What do I do next? }) Thanks, Mahesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: GroupByKey results in OOM - Any other alternative
Ian, Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a wrapper around the com.clearspring.analytics.stream.cardinality.HyperLogLogPlus. Cheers k/ On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell i...@ianoconnell.com wrote: Depending on your requirements when doing hourly metrics calculating distinct cardinality, a much more scalable method would be to use a hyper log log data structure. a scala impl people have used with spark would be https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala On Sun, Jun 15, 2014 at 6:16 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Vivek, If the foldByKey solution doesn't work for you, my team uses RDD.persist(DISK_ONLY) to avoid OOM errors. It's slower, of course, and requires tuning other config parameters. It can also be a problem if you do not have enough disk space, meaning that you have to unpersist at the right points if you are running long flows. For us, even though the disk writes are a performance hit, we prefer the Spark programming model to Hadoop M/R. But we are still working on getting this to work end to end on 100s of GB of data on our 16-node cluster. Suren On Sun, Jun 15, 2014 at 12:08 AM, Vivek YS vivek...@gmail.com wrote: Thanks for the input. I will give foldByKey a shot. The way I am doing is, data is partitioned hourly. So I am computing distinct values hourly. Then I use unionRDD to merge them and compute distinct on the overall data. Is there a way to know which key,value pair is resulting in the OOM ? Is there a way to set parallelism in the map stage so that, each worker will process one key at time. ? I didn't realise countApproxDistinctByKey is using hyperloglogplus. This should be interesting. --Vivek On Sat, Jun 14, 2014 at 11:37 PM, Sean Owen so...@cloudera.com wrote: Grouping by key is always problematic since a key might have a huge number of values. You can do a little better than grouping *all* values and *then* finding distinct values by using foldByKey, putting values into a Set. At least you end up with only distinct values in memory. (You don't need two maps either, right?) If the number of distinct values is still huge for some keys, consider the experimental method countApproxDistinctByKey: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L285 This should be much more performant at the cost of some accuracy. On Sat, Jun 14, 2014 at 1:58 PM, Vivek YS vivek...@gmail.com wrote: Hi, For last couple of days I have been trying hard to get around this problem. Please share any insights on solving this problem. Problem : There is a huge list of (key, value) pairs. I want to transform this to (key, distinct values) and then eventually to (key, distinct values count) On small dataset groupByKey().map( x = (x_1, x._2.distinct)) ...map(x = (x_1, x._2.distinct.count)) On large data set I am getting OOM. Is there a way to represent Seq of values from groupByKey as RDD and then perform distinct over it ? Thanks Vivek -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Multi-dimensional Uniques over large dataset
And got the first cut: val res = pairs.groupByKey().map((x) = (x._1, x._2.size, x._2.toSet.size)) gives the total unique. The question : is it scalable efficient ? Would appreciate insights. Cheers k/ On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar ksanka...@gmail.com wrote: Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD) works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records memory efficient. heers k/ On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote: Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2 - d5,c1,a3 - d5,c2,a2 - d5,c3,a2 - Want to find uniques and totals (of the d_ across the c_ and a_ dimensions): - Tot Unique - c1 6 4 - c2 4 4 - c3 2 2 - a1 7 3 - a2 4 3 - a3 2 2 - c1-a1 ... - c1-a2 ... - c1-a3 ... - c2-a1 ... - c2-a2 ... - ... - c3-a3 - Obviously there are millions of records and more attributes/dimensions. So scalability is key 2. We think Spark is a good stack for this problem: Have a few questions: 3. From a Spark substrate perspective, what are some of the optimum transformations things to watch out for ? 4. Is PairRDD the best data representation ? GroupByKey et al is only available for PairRDD. 5. On a pragmatic level, file.map().map() results in RDD. How do I transform it to a PairRDD ? 1. .map(fields = (fields(1), fields(0)) - results in Unit 2. .map(fields = fields(1) - fields(0)) also is not working 3. Both these do not result in a PairRDD 4. Am missing something fundamental. Cheers Have a nice weekend k/
Multi-dimensional Uniques over large dataset
Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2 - d5,c1,a3 - d5,c2,a2 - d5,c3,a2 - Want to find uniques and totals (of the d_ across the c_ and a_ dimensions): - Tot Unique - c1 6 4 - c2 4 4 - c3 2 2 - a1 7 3 - a2 4 3 - a3 2 2 - c1-a1 ... - c1-a2 ... - c1-a3 ... - c2-a1 ... - c2-a2 ... - ... - c3-a3 - Obviously there are millions of records and more attributes/dimensions. So scalability is key 2. We think Spark is a good stack for this problem: Have a few questions: 3. From a Spark substrate perspective, what are some of the optimum transformations things to watch out for ? 4. Is PairRDD the best data representation ? GroupByKey et al is only available for PairRDD. 5. On a pragmatic level, file.map().map() results in RDD. How do I transform it to a PairRDD ? 1. .map(fields = (fields(1), fields(0)) - results in Unit 2. .map(fields = fields(1) - fields(0)) also is not working 3. Both these do not result in a PairRDD 4. Am missing something fundamental. Cheers Have a nice weekend k/
Re: Multi-dimensional Uniques over large dataset
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD) works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records memory efficient. heers k/ On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote: Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2 - d5,c1,a3 - d5,c2,a2 - d5,c3,a2 - Want to find uniques and totals (of the d_ across the c_ and a_ dimensions): - Tot Unique - c1 6 4 - c2 4 4 - c3 2 2 - a1 7 3 - a2 4 3 - a3 2 2 - c1-a1 ... - c1-a2 ... - c1-a3 ... - c2-a1 ... - c2-a2 ... - ... - c3-a3 - Obviously there are millions of records and more attributes/dimensions. So scalability is key 2. We think Spark is a good stack for this problem: Have a few questions: 3. From a Spark substrate perspective, what are some of the optimum transformations things to watch out for ? 4. Is PairRDD the best data representation ? GroupByKey et al is only available for PairRDD. 5. On a pragmatic level, file.map().map() results in RDD. How do I transform it to a PairRDD ? 1. .map(fields = (fields(1), fields(0)) - results in Unit 2. .map(fields = fields(1) - fields(0)) also is not working 3. Both these do not result in a PairRDD 4. Am missing something fundamental. Cheers Have a nice weekend k/
Re: problem starting the history server on EC2
Yep, it gives tons of errors. I was able to make it work with sudo. Looks like ownership issue. Cheers k/ On Tue, Jun 10, 2014 at 6:29 PM, zhen z...@latrobe.edu.au wrote: I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I do not seem to be able to start the history server on the master node. I used the following command: ./start-history-server.sh /root/spark_log The error message says that the logging directory /root/spark_log does not exist. But I have definitely created the directory and made sure everyone can read/write/execute in the directory. Can you tell me why it does not work? Thank you Zhen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-starting-the-history-server-on-EC2-tp7361.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to compile a Spark project in Scala IDE for Eclipse?
Project-Properties-Java Build Path-Add External Jars Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar Cheers K/ On Sun, Jun 8, 2014 at 8:06 AM, Carter gyz...@hotmail.com wrote: Hi All, I just downloaded the Scala IDE for Eclipse. After I created a Spark project and clicked Run there was an error on this line of code import org.apache.spark.SparkContext: object apache is not a member of package org. I guess I need to import the Spark dependency into Scala IDE for Eclipse, can anyone tell me how to do it? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Trouble launching EC2 Cluster with Spark
chmod 600 path/FinalKey.pem Cheers k/ On Wed, Jun 4, 2014 at 12:49 PM, Sam Taylor Steyer sste...@stanford.edu wrote: Also, once my friend logged in to his cluster he received the error Permissions 0644 for 'FinalKey.pem' are too open. This sounds like the other problem described. How do we make the permissions more private? Thanks very much, Sam - Original Message - From: Sam Taylor Steyer sste...@stanford.edu To: user@spark.apache.org Sent: Wednesday, June 4, 2014 12:42:04 PM Subject: Re: Trouble launching EC2 Cluster with Spark Thanks you! The regions advice solved the problem for my friend who was getting the key pair does not exist problem. I am still getting the error: ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'null' for protocol. VPC security group rules must specify protocols explicitly./Message/Error/ErrorsRequestID7ff92687-b95a-4a39-94cb-e2d00a6928fd/RequestID/Response This sounds like it could have to do with the access settings of the security group, but I don't know how to change. Any advice would be much appreciated! Sam - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, June 4, 2014 8:52:59 AM Subject: Re: Trouble launching EC2 Cluster with Spark One reason could be that the keys are in a different region. Need to create the keys in us-east-1-North Virginia. Cheers k/ On Wed, Jun 4, 2014 at 7:45 AM, Sam Taylor Steyer sste...@stanford.edu wrote: Hi, I am trying to launch an EC2 cluster from spark using the following command: ./spark-ec2 -k HackerPair -i [path]/HackerPair.pem -s 2 launch HackerCluster I set my access key id and secret access key. I have been getting an error in the setting up security groups... phase: Invalid value 'null' for protocol. VPC security groups must specify protocols explicitly. My project partner gets one step further and then gets the error The key pair 'JamesAndSamTest' does not exist. Any thoughts as to how we could fix these problems? Thanks a lot! Sam
Re: Spark Usecase
Shahab, Interesting question. Couple of points (based on the information from your e-mail) 1. One can support the use case in Spark as a set of transformations on a WIP TDD over a span of time and the final transformation outputting to a processed TDD - Spark streaming would be a good data ingestion mechanism - look at the system as a pipeline that spans a time window - Depending on the cardinality, you would need a correlation id to transform the pipeline as you get more data 2. Having said that, you do have to understand what value spark provides, then design the topology to support that. - For example, you could potentially keep all the WIP in HBase the final transformations in Spark TDD. - Or may be you keep all the WIP in Spark and the final processed records in HBase. There is nothing wrong in keeping WIP in Spark, if response time to process the incoming data set is important. 3. Naturally start with a set of ideas, make a few assumptions and do an e2e POC. That will clear many of the questions and firm up the design. HTH. Cheers k/ On Wed, Jun 4, 2014 at 6:57 AM, Shahab Yunus shahab.yu...@gmail.com wrote: Hello All. I have a newbie question. We have a use case where huge amount of data will be coming in streams or micro-batches of streams and we want to process these streams according to some business logic. We don't have to provide extremely low latency guarantees but batch M/R will still be slow. Now the business logic is such that at the time of emitting the data, we might have to hold on to some tuples until we get more information. This 'more' information is essentially will be coming in streams of future streams. You can say that this is kind of *word count* use case where we have to *aggregate and maintain state across batches of streams.* One thing different here is that we might have to* maintain the state or data for a day or two* until rest of the data comes in and then we can complete our output. 1- Questions is that is such is use cases supported in Spark and/or Spark Streaming? 2- Will we be able to persist partially aggregated data until the rest of the information comes in later in time? I am mentioning *persistence* here that given that the delay can be spanned over a day or two we won't want to keep the partial data in memory for so long. I know this can be done in Storm but I am really interested in Spark because of its close integration with Hadoop. We might not even want to use Spark Streaming (which is more of a direct comparison with Storm/Trident) given our application does not have to be real-time in split-second. Feel free to direct me to any document or resource. Thanks a lot. Regards, Shahab
Re: Why Scala?
Nicholas, Good question. Couple of thoughts from my practical experience: - Coming from R, Scala feels more natural than other languages. The functional succinctness of Scala is more suited for Data Science than other languages. In short, Scala-Spark makes sense, for Data Science, ML, Data Exploration et al - Having said that occasionally practicality does trump the choice of a language - last time I really wanted to use Scala but ended up in writing in Python ! Hope to get a better result this time - Language evolution is more of a long term granularity - we do underestimate the velocity impact. Have seen evolutions through languages starting from Cobol, CCP/M Basic,Turbo Pascal, ... I think Scala will find it's equilibrium sooner than we think ... Cheers k/ On Thu, May 29, 2014 at 5:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thank you for the specific points about the advantages Scala provides over other languages. Looking at several code samples, the reduction of boilerplate code over Java is one of the biggest plusses, to me. On Thu, May 29, 2014 at 8:10 PM, Marek Kolodziej mkolod@gmail.com wrote: I would advise others to form their opinions based on experiencing it for themselves, rather than reading what random people say on Hacker News. :) Just a nitpick here: What I said was It looks like the language is fairly controversial on [Hacker News.] That was just an observation of what I saw on HN, not a statement of my opinion. I know very little about Scala (or Java, for that matter) and definitely don't have a well-formed opinion on the matter. Nick
Re: K-nearest neighbors search in Spark
Carter, Just as a quick simple starting point for Spark. (caveats - lots of improvements reqd for scaling, graceful and efficient handling of RDD et al): import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import scala.collection.immutable.ListMap import scala.collection.immutable.SortedMap object TopK { // def getCurrentDirectory = new java.io.File( . ).getCanonicalPath // def distance(x1:List[Int],x2:List[Int]):Double = { val dist:Double = math.sqrt(math.pow(x1(1)-x2(1),2) + math.pow(x1(2)-x2( 2),2)) dist } // def main(args: Array[String]): Unit = { // println(getCurrentDirectory) val sc = new SparkContext(local,TopK, spark://USS-Defiant.local:7077) println(sRunning Spark Version ${sc.version}) val file = sc.textFile(data01.csv) // val data = file .map(line = line.split(,)) .map(x1 = List(x1(0).toInt,x1(1).toInt,x1(2).toInt)) //val data1 = data.collect println(data) for (d - data) { println(d) println(d(0)) } // val distList = for (d - data) yield {d(0)} //for (d - distList) (println(d)) val zipList = for (a - distList.collect; b - distList.collect) yield{ List( a,b)} zipList.foreach(println(_)) // val dist = for (l - zipList) yield { println(s${l(0)} = ${l(1)}) val x1a:Array[List[Int]] = data.filter(d = d(0) == l(0)).collect val x2a:Array[List[Int]] = data.filter(d = d(0) == l(1)).collect val x1:List[Int] = x1a(0) val x2:List[Int] = x2a(0) val dist = distance(x1,x2) Map ( dist - l ) } dist.foreach(println(_)) // sort this for topK // } } data01.csv 1,68,93 2,12,90 3,45,76 4,86,54 HTH. Cheers k/ On Tue, May 27, 2014 at 4:10 AM, Carter gyz...@hotmail.com wrote: Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to Run Machine Learning Examples
I couldn't find the classification.SVM class. - Most probably the command is something of the order of: - bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification examples/target/scala-*/spark-examples-*.jar --algorithm SVM train.csv - For more details try - ./bin/run-example mllib.BinaryClassification spark url - Usage: BinaryClassification [options] input - --numIterations value number of iterations - --stepSize valueinitial step size, default: 1.0 - --algorithm value algorithm (SVM,LR), default: LR - --regType value regularization type (L1,L2), default: L2 - --regParam value regularization parameter, default: 0.1 - input input paths to labeled examples in LIBSVM format HTH. Cheers k/ P.S: I am using 1.0.0 rc10. Even for earlier release, just run the classification class and it will tell you what the parameters are. Most probably SVM is an algorithm parameter not a class by itself. On Thu, May 22, 2014 at 2:12 PM, yxzhao yxz...@ualr.edu wrote: Thanks Stephen, I used the following commnad line to run the SVM, but it seems that the path is not correct. What the right path or command line should be? Thanks. *./bin/run-example org.apache.spark.mllib.classification.SVM spark://100.1.255.193:7077 http://100.1.255.193:7077 train.csv 20* Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/mllib/classification/SVM Caused by: java.lang.ClassNotFoundException: org.apache.spark.mllib.classification.SVM at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: org.apache.spark.mllib.classification.SVM. Program will exit. On Thu, May 22, 2014 at 3:05 PM, Stephen Boesch [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=6287i=0 wrote: There is a bin/run-example.sh example-class [args] 2014-05-22 12:48 GMT-07:00 yxzhao [hidden email]: I want to run the LR, SVM, and NaiveBayes algorithms implemented in the following directory on my data set. But I did not find the sample command line to run them. Anybody help? Thanks. spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277p6278.html To unsubscribe from How to Run Machine Learning Examples, click here. NAML -- View this message in context: Re: How to Run Machine Learning Exampleshttp://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277p6287.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: Run Apache Spark on Mini Cluster
It depends on what stack you want to run. A quick cut: - Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes) - Dual 6 core CPU - 64 to 128 GB RAM - 3 X 3TB disk (JBOD) - Master Node (Name Node, HBase Master,Spark Master) - Dual 6 core CPU - 64 to 128 GB RAM - 2 X 3TB disk (RAID 1+0) - Start with a 5 node setup and scale out as needed - If your load is Mapreduce over HDFS, then run YRAN - If your load is HBase over HDFS, scale depending on the computational and storage needs - If you are running Spark over HDFS, scale appropriately - you might need more memory in the worker nodes - In any case, have a topology and the processes that they would run. As Soumya suggests, you can prototype at an appropriate scale using AWS. Cheers k/. On Wed, May 21, 2014 at 5:14 PM, Upender Nimbekar upent...@gmail.comwrote: Hi, I would like to setup apache platform on a mini cluster. Is there any recommendation for the hardware that I can buy to set it up. I am thinking about processing significant amount of data like in the range of few terabytes. Thanks Upender