Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi,
   Just after I sent the mail, I realized that the error might be with the
training-dataset not the test-dataset.

   1. it might be that you are feeding the full Y vector for training.
   2. Which could mean, you are using ~50-50 training-test split.
   3. Take a good look at the code that does the data split and the
   datasets where they are allocated to.

Cheers


On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com> wrote:

> Hi,
>   Looks like the test-dataset has different sizes for X & Y. Possible
> steps:
>
>1. What is the test-data-size ?
>   - If it is 15,909, check the prediction variable vector - it is now
>   29,471, should be 15,909
>   - If you expect it to be 29,471, then the X Matrix is not right.
>   2. It is also probable that the size of the test-data is something
>else. If so, check the data pipeline.
>3. If you print the count() of the various vectors, I think you can
>find the error.
>
> Cheers & Good Luck
> 
>
> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have built the logistic regression model using training-dataset.
>> When I am predicting on a test-dataset, it is throwing the below error of
>> size mismatch.
>>
>> Steps done:
>> 1. String indexers on categorical features.
>> 2. One hot encoding on these indexed features.
>>
>> Any help is appreciated to resolve this issue or is it a bug ?
>>
>> SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0
>> failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID
>> 19421, localhost): java.lang.IllegalArgumentException: requirement failed:
>> BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:
>> x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224)
>> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
>> org.apache.spark.ml.classification.LogisticRegressionModel$$
>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml
>> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>> at org.apache.spark.ml.classification.LogisticRegressionModel.p
>> redictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.classifica
>> tion.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484) at
>> org.apache.spark.ml.classification.ProbabilisticClassificati
>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>> org.apache.spark.ml.classification.ProbabilisticClassificati
>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.evalExpr137$(Unknown Source) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.apply(Unknown Source) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.apply(Unknown Source) at
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>
>
>


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi,
  Looks like the test-dataset has different sizes for X & Y. Possible steps:

   1. What is the test-data-size ?
  - If it is 15,909, check the prediction variable vector - it is now
  29,471, should be 15,909
  - If you expect it to be 29,471, then the X Matrix is not right.
  2. It is also probable that the size of the test-data is something
   else. If so, check the data pipeline.
   3. If you print the count() of the various vectors, I think you can find
   the error.

Cheers & Good Luck


On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty 
wrote:

> Hi,
>
> I have built the logistic regression model using training-dataset.
> When I am predicting on a test-dataset, it is throwing the below error of
> size mismatch.
>
> Steps done:
> 1. String indexers on categorical features.
> 2. One hot encoding on these indexed features.
>
> Any help is appreciated to resolve this issue or is it a bug ?
>
> SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0
> failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID
> 19421, localhost): java.lang.IllegalArgumentException: requirement failed:
> BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:
> x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
> org.apache.spark.ml.classification.LogisticRegressionModel$$
> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml.
> classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
> at org.apache.spark.ml.classification.LogisticRegressionModel.
> predictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.
> classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
> at org.apache.spark.ml.classification.ProbabilisticClassificationMod
> el$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
> org.apache.spark.ml.classification.ProbabilisticClassificationMod
> el$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.evalExpr137$(Unknown Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.apply(Unknown Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.apply(Unknown Source) at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>


Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Krishna Sankar
This intrigued me as well.

   - Just for sure, I downloaded the 1.6.2 code and recompiled.
   - spark-shell and pyspark both show 1.6.2 as expected.

Cheers

On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:

> Another possible explanation is that by accident you are still running
> Spark 1.6.1. Which download are you using? This is what I see:
>
> $ ~/spark-1.6.2-bin-hadoop2.6/bin/spark-shell
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Using Spark's repl log4j profile:
> org/apache/spark/log4j-defaults-repl.properties
> To adjust logging level use sc.setLogLevel("INFO")
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.2
>   /_/
>
>
> On Mon, Jul 25, 2016 at 7:45 AM, Sean Owen  wrote:
>
>> Are you certain? looks like it was correct in the release:
>>
>>
>> https://github.com/apache/spark/blob/v1.6.2/core/src/main/scala/org/apache/spark/package.scala
>>
>>
>>
>> On Mon, Jul 25, 2016 at 12:33 AM, Ascot Moss 
>> wrote:
>> > Hi,
>> >
>> > I am trying to upgrade spark from 1.6.1 to 1.6.2, from 1.6.2
>> spark-shell, I
>> > found the version is still displayed 1.6.1
>> >
>> > Is this a minor typo/bug?
>> >
>> > Regards
>> >
>> >
>> >
>> > ###
>> >
>> > Welcome to
>> >
>> >     __
>> >
>> >  / __/__  ___ _/ /__
>> >
>> > _\ \/ _ \/ _ `/ __/  '_/
>> >
>> >/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>> >
>> >   /_/
>> >
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar
Thanks Nick. I also ran into this issue.
VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
then use the dataset for the evaluator. In real life, probably detect the
NaN and recommend most popular on some window.
HTH.
Cheers


On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath 
wrote:

> It seems likely that you're running into
> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
> test dataset in the train/test split contains users or items that were not
> in the training set. Hence the model doesn't have computed factors for
> those ids, and ALS 'transform' currently returns NaN for those ids. This in
> turn results in NaN for the evaluator result.
>
> I have a PR open on that issue that will hopefully address this soon.
>
>
> On Sun, 24 Jul 2016 at 17:49 VG  wrote:
>
>> ping. Anyone has some suggestions/advice for me .
>> It will be really helpful.
>>
>> VG
>> On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:
>>
>>> Sean,
>>>
>>> I did this just to test the model. When I do a split of my data as
>>> training to 80% and test to be 20%
>>>
>>> I get a Root-mean-square error = NaN
>>>
>>> So I am wondering where I might be going wrong
>>>
>>> Regards,
>>> VG
>>>
>>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:
>>>
 No, that's certainly not to be expected. ALS works by computing a much
 lower-rank representation of the input. It would not reproduce the
 input exactly, and you don't want it to -- this would be seriously
 overfit. This is why in general you don't evaluate a model on the
 training set.

 On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
 > I am trying to run ml.ALS to compute some recommendations.
 >
 > Just to test I am using the same dataset for training using ALSModel
 and for
 > predicting the results based on the model .
 >
 > When I evaluate the result using RegressionEvaluator I get a
 > Root-mean-square error = 1.5544064263236066
 >
 > I thin this should be 0. Any suggestions what might be going wrong.
 >
 > Regards,
 > Vipul

>>>
>>>


Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar
Hi all,
   Just wanted to thank all for the dataset API - most of the times we see
only bugs in these lists ;o).

   - Putting some context, this weekend I was updating the SQL chapters of
   my book - it had all the ugliness of SchemaRDD,
   registerTempTable, take(10).foreach(println)
   and take(30).foreach(e=>println("%15s | %9.2f |".format(e(0),e(1 ;o)
   - I remember Hossein Falaki chiding me about the ugly println statements
  !
  - Took me a little while to grok the dataset, sparksession,
  spark.read.option("header","true").option("inferSchema","true").csv(...)
et
  al.
 - I am a big R fan and know the language pretty decent - so the
 constructs are familiar
  - Once I got it ( I am sure still there are more mysteries to uncover
   ...) it was just beautiful - well done folks !!!
   - One sees the contrast a lot better while teaching or writing books,
   because one has to think thru the old, the new and the transitional arc
  - I even remember the good old days when we were discussing whether
  Spark would get the dataframes like R at one of Paco's sessions !
  - And now, it looks very decent for data wrangling.

Cheers & keep up the good work

P.S: My next chapter is the MLlib - need to convert to ml. Should be
interesting ... I am a glutton for punishment - of the Spark kind, of
course !


Re: Is Spark right for us?

2016-03-06 Thread Krishna Sankar
Good question. It comes to computational complexity, computational scale
and data volume.

   1. If you can store the data in a single server or a small cluster of db
   server (say mysql) then hdfs/Spark might be an overkill
   2. If you can run the computation/process the data on a single machine
   (remember servers with 512 GB memory and quad core CPUs can do a lot of
   stuff) then Spark is an overkill
   3. Even if you can do computations #1 & #2 above, in a pipeline and
   tolerate the elapsed time, Spark might be an overkill
   4. But if you require data/computation parallelism or distributed
   processing of data due to computation complexities or data volume or time
   constraints incl real time analytics, Spark is the right stack.
   5. Taking a quick look at what you have described so far, probably Spark
   is not needed.

Cheers & HTH


On Sun, Mar 6, 2016 at 9:17 AM, Laumegui Deaulobi <
guillaume.bilod...@gmail.com> wrote:

> Our problem space is survey analytics.  Each survey comprises a set of
> questions, with each question having a set of possible answers.  Survey
> fill-out tasks are sent to users, who have until a certain date to complete
> it.  Based on these survey fill-outs, reports need to be generated.  Each
> report deals with a subset of the survey fill-outs, and comprises a set of
> data points (average rating for question 1, min/max for question 2, etc.)
>
> We are dealing with rather large data sets - although reading the internet
> we get the impression that everyone is analyzing petabytes of data...
>
> Users: up to 100,000
> Surveys: up to 100,000
> Questions per survey: up to 100
> Possible answers per question: up to 10
> Survey fill-outs / user: up to 10
> Reports: up to 100,000
> Data points per report: up to 100
>
> Data is currently stored in a relational database but a migration to a
> different kind of store is possible.
>
> The naive algorithm for report generation can be summed up as this:
>
> for each report to be generated {
>   for each report data point to be calculated {
> calculate data point
> add data point to report
>   }
>   publish report
> }
>
> In order to deal with the upper limits of these values, we will need to
> distribute this algorithm to a compute / data cluster as much as possible.
>
> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
> HazelCast and several others, and am still confused as to how each of these
> can help us and how they fit together.
>
> Is Spark the right framework for us?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: HDP 2.3 support for Spark 1.5.x

2015-09-28 Thread Krishna Sankar
Thanks Guys. Yep, now I would install 1.5.1 over HDP 2.3, if that works.
Cheers


On Mon, Sep 28, 2015 at 9:47 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Krishna:
> If you want to query ORC files, see the following JIRA:
>
> [SPARK-10623] [SQL] Fixes ORC predicate push-down
>
> which is in the 1.5.1. release.
>
> FYI
>
> On Mon, Sep 28, 2015 at 9:42 AM, Fabien Martin <fabien.marti...@gmail.com>
> wrote:
>
>> Hi Krishna,
>>
>>- Take a lokk at 
>> *http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/
>>
>> <http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/>*
>>- Or you can specify your 1.5.x jar as the Spark one using something
>>like :
>>
>> --conf
>> "spark.yarn.jar=hdfs://master:8020/spark-assembly-1.5.0-hadoop2.6.0.jar"
>>
>> The main drawback is :
>>
>> *Known Issues*
>>
>> *Spark YARN ATS integration does not work in this tech preview. You will
>> not see the history of Spark jobs in the Jobs server after a job is
>> finished.*
>>
>> 2015-09-23 1:31 GMT+02:00 Zhan Zhang <zzh...@hortonworks.com>:
>>
>>> Hi Krishna,
>>>
>>> For the time being, you can download from upstream, and it should be
>>> running OK for HDP2.3.  For hdp specific problem, you can ask in
>>> Hortonworks forum.
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> On Sep 22, 2015, at 3:42 PM, Krishna Sankar <ksanka...@gmail.com> wrote:
>>>
>>> Guys,
>>>
>>>- We have HDP 2.3 installed just now. It comes with Spark 1.3.x. The
>>>current wisdom is that it will support the 1.4.x train (which is good, 
>>> need
>>>DataFrame et al).
>>>- What is the plan to support Spark 1.5.x ? Can we install 1.5.0 on
>>>HDP 2.3 ? Or will Spark 1.5.x support be in HDP 2.3.x and if so ~when ?
>>>
>>> Cheers & Thanks
>>> 
>>>
>>>
>>>
>>
>


Re: Spark MLib v/s SparkR

2015-08-05 Thread Krishna Sankar
A few points to consider:
a) SparkR gives the union of R_in_a_single_machine and the
distributed_computing_of_Spark:
b) It also gives the ability to wrangle with data in R, that is in the
Spark eco system
c) Coming to MLlib, the question is MLlib and R (not MLlib or R) -
depending on the scale, data location et al
d) As Ali mentioned, some of the MLlib might not be supported in R (I
haven't looked at it that carefully, but can be resolved by the APIs),
OTOH, 1.5 is on it's way.
e) So it all depends on the algorithms that one wants to use and whether
one needs R for pre or post processing
HTH.
Cheers
k/

On Wed, Aug 5, 2015 at 11:24 AM, praveen S mylogi...@gmail.com wrote:

 I was wondering when one should go for MLib or SparkR. What is the
 criteria or what should be considered before choosing either of the
 solutions for data analysis?
 or What is the advantages of Spark MLib over Spark R or advantages of
 SparkR over MLib?



Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar
Looks like reduceByKey() should work here.
Cheers
k/

On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna 
leonida.gianfa...@gmail.com wrote:

 Thanks a lot oubrik,

 I got your point, my consideration is that sum() should be already a
 built-in function for iterators in python.
 Anyway I tried your approach

 def mysum(iter):
 count = sum = 0
 for item in iter:
count += 1
sum += item
 return sum
 wordCountsGrouped = wordsGrouped.groupByKey().map(lambda
 (w,iterator):(w,mysum(iterator)))
 print wordCountsGrouped.collect()

 but i get the error below, any idea?

 TypeError: unsupported operand type(s) for +=: 'int' and 'ResultIterable'

 at
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
 at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 thx
 Leonida



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Sum-elements-of-an-iterator-inside-an-RDD-tp23775p23778.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




import pyspark.sql.Row gives error in 1.4.1

2015-07-02 Thread Krishna Sankar
Error - ImportError: No module named Row
Cheers  enjoy the long weekend
k/


Re: making dataframe for different types using spark-csv

2015-07-01 Thread Krishna Sankar
   - use .cast(...).alias('...') after the DataFrame is read.
   - sql.functions.udf for any domain-specific conversions.

Cheers
k/

On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 Hi experts!


 I am using spark-csv to lead csv data into dataframe. By default it makes
 type of each column as string. Is there some way to get dataframe of actual
 types like int,double etc.?


 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/making-dataframe-for-different-types-using-spark-csv-tp23570.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: SparkSQL built in functions

2015-06-29 Thread Krishna Sankar
Interesting. Looking at the definitions, sql.functions.pow is defined only
for (col,col). Just as an experiment, create a column with value 2 and see
if that works.
Cheers
k/

On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro rcors...@gmail.com wrote:

 1.4 and I did set the second parameter. The DSL works fine but trying out
 with SQL doesn't.

 On Mon, Jun 29, 2015, 4:32 PM Salih Oztop soz...@yahoo.com wrote:

 Hi Bob,
 I tested your scenario with Spark 1.3 and I assumed you did not miss the
 second parameter of pow(x,y)

 from pyspark.sql import SQLContext sqlContext = SQLContext(sc)
 df = sqlContext.jsonFile(/vagrant/people.json)
 # Displays the content of the DataFrame to stdout
 df.show()
 #These are all fine
 df.select(name, (df.age)*(df.age)).show()

 name(age * age)
 Michael null
 Andy900
 Justin  361


 df.select(name, (df.age)+1).show()

 name(age + 1)
 Michael null
 Andy31
 Justin  20


 However the following tests give the same error.

 df.select(name, pow(df.age,2)).show()

 ---TypeError
  Traceback (most recent call 
 last)ipython-input-27-ce7299d3ef76 in module() 1 df.select(name, 
 pow(df.age,2)).show()
 TypeError: unsupported operand type(s) for ** or pow(): 'Column' and 'int'


 df.select(name, (df.age)**2).show()

 ---TypeError
  Traceback (most recent call 
 last)ipython-input-24-29540c3536bf in module() 1 df.select(name, 
 (df.age)**2).show()
 TypeError: unsupported operand type(s) for ** or pow(): 'Column' and 'int'


 Moreover testing the functions individually they are working fine.

 pow(2,4)

 16

 2**4

 16



 Kind Regards
 Salih Oztop

   --
  *From:* Bob Corsaro rcors...@gmail.com
 *To:* user user@spark.apache.org
 *Sent:* Monday, June 29, 2015 7:27 PM
 *Subject:* SparkSQL built in functions

 I'm having trouble using select pow(col) from table It seems the
 function is not registered for SparkSQL. Is this on purpose or an
 oversight? I'm using pyspark.





Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
You can predict and then zip it with the points RDD to get approx. same as
LP.
Cheers
k/

On Thu, May 21, 2015 at 6:19 PM, anneywarlord anneywarl...@gmail.com
wrote:

 Hello,

 New to Spark. I wanted to know if it is possible to use a Labeled Point RDD
 in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I
 would
 like to be able to identify which observations were grouped with each
 centroid.

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Krishna Sankar
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to
download and add this.
Cheers
k/

On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer beancinemat...@gmail.com
wrote:

 Afternoon all,

 I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via:

 `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package`

 The error is encountered when running spark shell via:

 `spark-shell --packages com.databricks:spark-csv_2.11:1.0.3`

 The full trace of the commands can be found at
 https://gist.github.com/momer/9d1ca583f9978ec9739d

 Not sure if I've done something wrong, or if the documentation is
 outdated, or...?

 Would appreciate any input or push in the right direction!

 Thank you,

 Mo



Re: Dataset announcement

2015-04-15 Thread Krishna Sankar
Thanks Olivier. Good work.
Interesting in more than one ways - including training, benchmarking,
testing new releases et al.
One quick question - do you plan to make it available as an S3 bucket ?

Cheers
k/

On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc
wrote:

 Dear Spark users,

 I would like to draw your attention to a dataset that we recently released,
 which is as of now the largest machine learning dataset ever released; see
 the following blog announcements:
  - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
  -

 http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx

 The characteristics of this dataset are:
  - 1 TB of data
  - binary classification
  - 13 integer features
  - 26 categorical features, some of them taking millions of values.
  - 4B rows

 Hopefully this dataset will be useful to assess and push further the
 scalability of Spark and MLlib.

 Cheers,
 Olivier



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Krishna Sankar
Yep the command-option is gone. No big deal, just add the '%pylab
inline' command
as part of your notebook.
Cheers
k/

On Fri, Mar 20, 2015 at 3:45 PM, cong yue yuecong1...@gmail.com wrote:

 Hello :

 I tried ipython notebook with the following command in my enviroment.

 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
 --pylab inline ./bin/pyspark

 But it shows  --pylab inline support is removed from ipython newest
 version.
 the log is as :
 ---
 $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
 --pylab inline ./bin/pyspark
 [E 15:29:43.076 NotebookApp] Support for specifying --pylab on the
 command line has been removed.
 [E 15:29:43.077 NotebookApp] Please use `%pylab inline` or
 `%matplotlib inline` in the notebook itself.
 --
 I am using IPython 3.0.0. and only IPython works in my enviroment.
 --
 $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
 --pylab inline ./bin/pyspark
 --

 Does somebody have the same issue as mine? How do you solve it?

 Thanks,
 Cong

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Krishna Sankar
Without knowing the data size, computation  storage requirements ... :

   - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine.
   Probably 5-10 machines.
   - Don't go for the most exotic machines, otoh don't go for cheapest ones
   either.
  - Find a sweet spot with your vendor i.e. if dual 6 cores are a lot
  cheaper than dual 10 cores then go with the less expensive ones.
Same with
  disks - may be 2TB is a lot cheaper than 3 TB.
   - Decide if these are going to be storage intensive or compute intensive
   (I assume the latter) and configure accordingly
   - Make sure you can add storage to the machines - ie have free storage
   bays.
  - Or other way is to add more machines and buy the smaller speced
  machines.
   - Unless one has very firm I/O and compute requirements, I have found
   that FLOPS, and things of that nature, do not make that much sense.
  - Think in terms of RAM, CPU and storage - that is what will become
  the initial limitations.
  - Once there are enough production jobs, you can then figure out the
  FLOPS et al
   - 10 G network is a better choice, so price-in a 24-48 port TOR switch.
  - More concerned with the bandwidth between the cluster nodes, for
  shuffles et al

Cheers
k/

On Sun, Mar 8, 2015 at 2:29 PM, Nasir Khan nasirkhan.onl...@gmail.com
wrote:

 HI, I am going to submit a proposal to my University to setup my Standalone
 Spark Cluster, what hardware should i include in my proposal?

 I will be Working on classification (Spark MLlib) of Data streams (Spark
 Streams)

 If some body can fill up this answers, that will be great! Thanks

 *Cores *= (example 64 nodes, 1024 cores, your figures) ?

 *Performance**= (example= ~5.12TFlops, ~2TFlops, your figures) ___?

 *GPU*= YES/NO ___?

 *Fat Node* = YES/NO ___?

 *CPU Hrs/ Yr* = (example 2000, 8000, your figures) ___?

 *RAM/CPU* = (example 256GB, your figures) ___?
 *
 Storage Processing* = (example 200TB, your figures) ___?

 *Storage Output* = (example 5TB, 4TB HHD/SSD, your figures) ___?

 *Most processors today carryout 4 FLOPS per cycle,  thus a single-core 2.5
 GHz processor has a theoretical performance of 10 billion FLOPS = 10GFLOPS

 Note:I Need a *general purpose* cluster, not very high end nor very low
 specs. It will not be dedicated to just one project i guess. You people
 already have experience in setting up clusters, that's the reason i posted
 it here :)





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/General-Purpose-Spark-Cluster-Hardware-Requirements-tp21963.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Movie Recommendation tutorial

2015-02-24 Thread Krishna Sankar
Yep, much better with 0.1.

The best model was trained with rank = 12 and lambda = 0.1, and numIter =
20, and its RMSE on the test set is 0.869092 (Spark 1.3.0)

Question : What is the intuition behind RSME of 0.86 vs 1.3 ? I know the
smaller the better. But is it that better ? And what is a good number for a
recommendation engine ?

Cheers
k/

On Tue, Feb 24, 2015 at 1:03 AM, Guillaume Charhon 
guilla...@databerries.com wrote:

 I am using Spark 1.2.1.

 Thank you Krishna, I am getting almost the same results as you so it must
 be an error in the tutorial. Xiangrui, I made some additional tests with
 lambda to 0.1 and I am getting a much better rmse:

 RMSE (validation) = 0.868981 for the model trained with rank = 8, lambda =
 0.1, and numIter = 10.


 RMSE (validation) = 0.869628 for the model trained with rank = 8, lambda =
 0.1, and numIter = 20.


 RMSE (validation) = 1.361321 for the model trained with rank = 8, lambda =
 1.0, and numIter = 10.


 RMSE (validation) = 1.361321 for the model trained with rank = 8, lambda =
 1.0, and numIter = 20.


 RMSE (validation) = 3.755870 for the model trained with rank = 8, lambda =
 10.0, and numIter = 10.


 RMSE (validation) = 3.755870 for the model trained with rank = 8, lambda =
 10.0, and numIter = 20.


 RMSE (validation) = 0.866605 for the model trained with rank = 12, lambda
 = 0.1, and numIter = 10.


 RMSE (validation) = 0.867498 for the model trained with rank = 12, lambda
 = 0.1, and numIter = 20.


 RMSE (validation) = 1.361321 for the model trained with rank = 12, lambda
 = 1.0, and numIter = 10.


 RMSE (validation) = 1.361321 for the model trained with rank = 12, lambda
 = 1.0, and numIter = 20.


 RMSE (validation) = 3.755870 for the model trained with rank = 12, lambda
 = 10.0, and numIter = 10.


 RMSE (validation) = 3.755870 for the model trained with rank = 12, lambda
 = 10.0, and numIter = 20.


 The best model was trained with rank = 12 and lambda = 0.1, and numIter =
 10, and its RMSE on the test set is 0.865407.


 On Tue, Feb 24, 2015 at 7:23 AM, Xiangrui Meng men...@gmail.com wrote:

 Try to set lambda to 0.1. -Xiangrui

 On Mon, Feb 23, 2015 at 3:06 PM, Krishna Sankar ksanka...@gmail.com
 wrote:
  The RSME varies a little bit between the versions.
  Partitioned the training,validation,test set like so:
 
  training = ratings_rdd_01.filter(lambda x: (x[3] % 10)  6)
  validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and (x[3]
 %
  10)  8)
  test = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 8)
  Validation MSE :
 
  # 1.3.0 Mean Squared Error = 0.871456869392
  # 1.2.1 Mean Squared Error = 0.877305629074
 
  Itertools results:
 
  1.3.0 - RSME = 1.354839 (rank = 8 and lambda = 1.0, and numIter = 20)
  1.1.1 - RSME = 1.335831 (rank = 8 and lambda = 1.0, and numIter = 10)
 
  Cheers
  k/
 
  On Mon, Feb 23, 2015 at 12:37 PM, Xiangrui Meng men...@gmail.com
 wrote:
 
  Which Spark version did you use? Btw, there are three datasets from
  MovieLens. The tutorial used the medium one (1 million). -Xiangrui
 
  On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com
  wrote:
   What do you mean?
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769p21771.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 





Re: Movie Recommendation tutorial

2015-02-23 Thread Krishna Sankar
   1. The RSME varies a little bit between the versions.
   2. Partitioned the training,validation,test set like so:
   - training = ratings_rdd_01.filter(lambda x: (x[3] % 10)  6)
  - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and
  (x[3] % 10)  8)
  - test = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 8)
  - Validation MSE :
  -
 - # 1.3.0 Mean Squared Error = 0.871456869392
 - # 1.2.1 Mean Squared Error = 0.877305629074
  3. Itertools results:
  - 1.3.0 - RSME = 1.354839 (rank = 8 and lambda = 1.0, and numIter =
  20)
  - 1.1.1 - RSME = 1.335831 (rank = 8 and lambda = 1.0, and numIter =
  10)

Cheers
k/

On Mon, Feb 23, 2015 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:

 Which Spark version did you use? Btw, there are three datasets from
 MovieLens. The tutorial used the medium one (1 million). -Xiangrui

 On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com
 wrote:
  What do you mean?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769p21771.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: randomSplit instead of a huge map reduce ?

2015-02-21 Thread Krishna Sankar
   - Divide and conquer with reduceByKey (like Ashish mentioned, each pair
   being the key) would work - looks like a mapReduce with combiners
   problem. I think reduceByKey would use combiners while aggregateByKey
   wouldn't.
   - Could we optimize this further by using combineByKey directly ?

Cheers
k/

On Fri, Feb 20, 2015 at 6:39 PM, Ashish Rangole arang...@gmail.com wrote:

 Is there a check you can put in place to not create pairs that aren't in
 your set of 20M pairs? Additionally, once you have your arrays converted to
 pairs you can do aggregateByKey with each pair being the key.
 On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote:

 Hi,

 I am new to Spark and I think I missed something very basic.

 I have the following use case (I use Java and run Spark locally on my
 laptop):


 I have a JavaRDDString[]

 - The RDD contains around 72,000 arrays of strings (String[])

 - Each array contains 80 words (on average).


 What I want to do is to convert each array into a new array/list of pairs,
 for example:

 Input: String[] words = ['a', 'b', 'c']

 Output: List[String, Sting] pairs = [('a', 'b'), (a', 'c'), (b', 'c')]

 and then I want to count the number of times each pair appeared, so my
 final
 output should be something like:

 Output: List[String, Sting, Integer] result = [('a', 'b', 3), (a', 'c',
 8), (b', 'c', 10)]


 The problem:

 Since each array contains around 80 words, it returns around 3,200 pairs,
 so
 after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs
 to
 reduce which require way too much memory.

 (I know I have only around *20,000,000* unique pairs!)

 I already modified my code and used 'mapPartitions' instead of 'map'. It
 definitely improved the performance, but I still feel I'm doing something
 completely wrong.


 I was wondering if this is the right 'Spark way' to solve this kind of
 problem, or maybe I should do something like splitting my original RDD
 into
 smaller parts (by using randomSplit), then iterate over each part,
 aggregate
 the results into some result RDD (by using 'union') and move on to the
 next
 part.


 Can anyone please explain me which solution is better?


 Thank you very much,

 Shlomi.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark-shell working in scala-2.11

2015-01-28 Thread Krishna Sankar
Stephen,
   Scala 2.11 worked fine for me. Did the dev change and then compile. Not
using in production, but I go back and forth between 2.10  2.11.
Cheers
k/

On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman 
stephen.haber...@gmail.com wrote:

 Hey,

 I recently compiled Spark master against scala-2.11 (by running
 the dev/change-versions script), but when I run spark-shell,
 it looks like the sc variable is missing.

 Is this a known/unknown issue? Are others successfully using
 Spark with scala-2.11, and specifically spark-shell?

 It is possible I did something dumb while compiling master,
 but I'm not sure what it would be.

 Thanks,
 Stephen

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




[no subject]

2015-01-10 Thread Krishna Sankar
Guys,

registerTempTable(Employees)

gives me the error

Exception in thread main scala.ScalaReflectionException: class
org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial
classloader with boot classpath
[/Applications/eclipse/plugins/org.scala-lang.scala-library_2.11.4.v20141023-110636-d783face36.jar:/Applications/eclipse/plugins/org.scala-lang.scala-reflect_2.11.4.v20141023-110636-d783face36.jar:/Applications/eclipse/plugins/org.scala-lang.scala-actors_2.11.4.v20141023-110636-d783face36.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/classes]
not found.


Probably something obvious I am missing.

Everything else works fine, so far.

Any easy fix ?

Cheers

k/


Re: DeepLearning and Spark ?

2015-01-09 Thread Krishna Sankar
I am also looking at this domain. We could potentially use the broadcast
capability in Spark to distribute the parameters. Haven't thought thru yet.
Cheers
k/

On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote:

 Does it makes sense to use Spark's actor system (e.g. via
 SparkContext.env.actorSystem) to create parameter server?

 On Fri, Jan 9, 2015 at 10:09 PM, Peng Cheng rhw...@gmail.com wrote:

 You are not the first :) probably not the fifth to have the question.
 parameter server is not included in spark framework and I've seen all
 kinds of hacking to improvise it: REST api, HDFS, tachyon, etc.
 Not sure if an 'official' benchmark  implementation will be released soon

 On 9 January 2015 at 10:59, Marco Shaw marco.s...@gmail.com wrote:

 Pretty vague on details:


 http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199


 On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Hi all,

 DeepLearning algorithms are popular and achieve many state of the art
 performance in several real world machine learning problems. Currently
 there are no DL implementation in spark and I wonder if there is an ongoing
 work on this topics.

 We can do DL in spark Sparkling water and H2O but this adds an
 additional software stack.

 Deeplearning4j seems to implements a distributed version of many popural
 DL algorithm. Porting DL4j in Spark can be interesting.

 Google describes an implementation of a large scale DL in this paper
 http://research.google.com/archive/large_deep_networks_nips2012.html.
 Based on model parallelism and data parallelism.

 So, I'm trying to imaging what should be a good design for DL algorithm
 in Spark ? Spark already have RDD (for data parallelism). Can GraphX be
 used for the model parallelism (as DNN are generally designed as DAG) ? And
 what about using GPUs to do local parallelism (mecanism to push partition
 into GPU memory ) ?


 What do you think about this ?


 Cheers,

 Jao






Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Krishna Sankar
Interestingly Google Chrome translates the materials.
Cheers
k/

On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas vcsub...@gmail.com wrote:

 I do not understand Chinese but the diagrams on that page are very helpful.

 On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote:

 A good beginning if you are chinese.

 https://github.com/JerryLead/SparkInternals/tree/master/markdown

 2015-01-07 10:13 GMT+08:00 bit1...@163.com bit1...@163.com:

 Thank you, Tobias. I will look into  the Spark paper. But it looks that
 the paper has been moved,
 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
 A web page is returned (Resource not found)when I access it.

 --
 bit1...@163.com


 *From:* Tobias Pfeiffer t...@preferred.jp
 *Date:* 2015-01-07 09:24
 *To:* Todd bit1...@163.com
 *CC:* user user@spark.apache.org
 *Subject:* Re: I think I am almost lost in the internals of Spark
 Hi,

 On Tue, Jan 6, 2015 at 11:24 PM, Todd bit1...@163.com wrote:

 I am a bit new to Spark, except that I tried simple things like word
 count, and the examples given in the spark sql programming guide.
 Now, I am investigating the internals of Spark, but I think I am almost
 lost, because I could not grasp a whole picture what spark does when it
 executes the word count.


 I recommend understanding what an RDD is and how it is processed, using

 http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds
 and probably also
   http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
   (once the server is back).
 Understanding how an RDD is processed is probably most helpful to
 understand the whole of Spark.

 Tobias




 --
 王海华





Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Krishna Sankar
Alec,
   Good questions. Suggestions:

   1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer,
   Cache, Queue, App Server, App (Interface), App (backend ML) et al.
   2. Then slot-in the appropriate technologies - may be even multiple
   technologies for the same layer and then work thru the pros and cons.
   3. Looking at the layers (moving from the easy to difficult, the mundane
   to the esoteric ;o)):
  - Cache  Queue - stick with what you are comfortable with ie Redis
  et al. Also take a look at Kafka
  - App Server - Tomcat et al
  - App (Interface) - JavaScript et al
  - DB, SQL Layer - Better off with with MongoDB. You can explore
  HBase, but it is not the same.
 - The same way as MongoDB != mySQL, HBase != MongoDB
  - Machine Learning Server/Layer - Spark would fit very well here.
  - Machine Learning DFS, Data Store - HDFS
  - The idea of pushing the data to Hadoop for ML is good
 - But you need to think thru things like incremental data load,
 semantics like at least once, at most once et al.
  4. You could architect all with the Hadoop eco system. It might work,
   depending on the system.
  - But I would use caution. Most probably many of the elements would
  rather be implemented in appropriate technologies.
  5. Doubleclick couple more times on the design, think thru the
   functionality, scaling requirements et al
  - Draw 3 or 4 alternatives and jot down the top 5 requirements, pros
  and cons, the knowns and the unknowns
  - The optimum design will fall thru

Cheers
k/

On Sat, Jan 3, 2015 at 4:43 PM, Alec Taylor alec.tayl...@gmail.com wrote:

 In the middle of doing the architecture for a new project, which has
 various machine learning and related components, including:
 recommender systems, search engines and sequence [common intersection]
 matching.

 Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
 backed by Redis).

 Though I don't have experience with Hadoop, I was thinking of using
 Hadoop for the machine-learning (as this will become a Big Data
 problem quite quickly). To push the data into Hadoop, I would use a
 connector of some description, or push the MongoDB backups into HDFS
 at set intervals.

 However I was thinking that it might be better to put the whole thing
 in Hadoop, store all persistent data in Hadoop, and maybe do all the
 layers in Apache Spark (with caching remaining in Redis).

 Is that a viable option? - Most of what I see discusses Spark (and
 Hadoop in general) for analytics only. Apache Phoenix exposes a nice
 interface for read/write over HBase, so I might use that if Spark ends
 up being the wrong solution.

 Thanks for all suggestions,

 Alec Taylor

 PS: I need this for both Big and Small data. Note that I am using
 the Cloudera definition of Big Data referring to processing/storage
 across more than 1 machine.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Calling ALS-MlLib from desktop application/ Training ALS

2014-12-13 Thread Krishna Sankar
a) There is no absolute RSME - it depends on the domain. Also RSME is the
error based on what you have seen so far, a snapshot of a slice of the
domain.
b) My suggestion is put the system in place, see what happens when users
interact with the system and then you can think of reducing the RSME as
needed. For all you know, RSME could go up with another set of data
c) I would prefer Scala, but Java would work as well.
d) For a desktop app, you have two ways to go.
Either run Spark in local machine and build an app or
Have Spark run in a server/cluster and build a browser app. This
depends on the data size and scaling requirements.
e) I haven't seen any C# interfaces. Might be a good request candidate.
Cheers
k/

On Sat, Dec 13, 2014 at 7:17 PM, Saurabh Agrawal saurabh.agra...@markit.com
 wrote:


 Requesting guidance on my queries in trail email.



 -Original Message-
 *From: *Saurabh Agrawal
 *Sent: *Saturday, December 13, 2014 07:06 PM GMT Standard Time
 *To: *user@spark.apache.org
 *Subject: *Building Desktop application for ALS-MlLib/ Training ALS



 Hi,



 I am a new bee in spark and scala world



 I have been trying to implement Collaborative filtering using MlLib
 supplied out of the box with Spark and Scala



 I have 2 problems



 1.   The best model was trained with rank = 20 and lambda = 5.0, and
 numIter = 10, and its RMSE on the test set is 25.718710831912485. The best
 model improves the baseline by 18.29%. Is there a scientific way in which
 RMSE could be brought down? What is a descent acceptable value for RMSE?

 2.   I picked up the Collaborative filtering algorithm from
 http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html
 and executed the given code with my dataset. Now, I want to build a
 desktop application around it.

 a.   What is the best language to do this Java/ Scala? Any
 possibility to do this using C#?

 b.  Can somebody please share any relevant documents/ source or any
 helper links to help me get started on this?



 Your help is greatly appreciated



 Thanks!!



 Regards,

 Saurabh Agrawal

 --
 This e-mail, including accompanying communications and attachments, is
 strictly confidential and only for the intended recipient. Any retention,
 use or disclosure not expressly authorised by Markit is prohibited. This
 email is subject to all waivers and other terms at the following link:
 http://www.markit.com/en/about/legal/email-disclaimer.page

 Please visit http://www.markit.com/en/about/contact/contact-us.page? for
 contact information on our offices worldwide.

 MarkitSERV Limited has its registered office located at Level 4, Ropemaker
 Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and
 regulated by the Financial Conduct Authority with registration number 207294



Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
Good point.
On the positive side, whether we choose the most efficient mechanism in
Scala might not be as important, as the Spark framework mediates the
distributed computation. Even if there is some declarative part in Spark,
we can still choose an inefficient computation path that is not apparent to
the framework.
Cheers
k/
P.S: Now Reply to ALL

On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski ognen.duzlev...@gmail.com
 wrote:

 On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com
 wrote:

 Java or Scala : I knew Java already yet I learnt Scala when I came across
 Spark. As others have said, you can get started with a little bit of Scala
 and learn more as you progress. Once you have started using Scala for a few
 weeks you would want to stay with it instead of going back to Java. Scala
 is arguably more elegant and less verbose than Java which translates into
 higher developer productivity and more maintainable code.


 Scala is arguably more elegant and less verbose than Java. However, Scala
 is also a complex language with a lot of details and tidbits and one-offs
 that you just have to remember.  It is sometimes difficult to make a
 decision whether what you wrote is the using the language features most
 effectively or if you missed out on an available feature that could have
 made the code better or more concise. For Spark you really do not need to
 know that much Scala but you do need to understand the essence of it.

 Thanks for the good discussion! :-)
 Ognen



Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
A very timely article
http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/
Cheers
k/
P.S: Now reply to ALL.

On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote:

 Good point.
 On the positive side, whether we choose the most efficient mechanism in
 Scala might not be as important, as the Spark framework mediates the
 distributed computation. Even if there is some declarative part in Spark,
 we can still choose an inefficient computation path that is not apparent to
 the framework.
 Cheers
 k/
 P.S: Now Reply to ALL

 On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski 
 ognen.duzlev...@gmail.com wrote:

 On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com
 wrote:

 Java or Scala : I knew Java already yet I learnt Scala when I came
 across Spark. As others have said, you can get started with a little bit of
 Scala and learn more as you progress. Once you have started using Scala for
 a few weeks you would want to stay with it instead of going back to Java.
 Scala is arguably more elegant and less verbose than Java which translates
 into higher developer productivity and more maintainable code.


 Scala is arguably more elegant and less verbose than Java. However, Scala
 is also a complex language with a lot of details and tidbits and one-offs
 that you just have to remember.  It is sometimes difficult to make a
 decision whether what you wrote is the using the language features most
 effectively or if you missed out on an available feature that could have
 made the code better or more concise. For Spark you really do not need to
 know that much Scala but you do need to understand the essence of it.

 Thanks for the good discussion! :-)
 Ognen





Re: Spark or MR, Scala or Java?

2014-11-22 Thread Krishna Sankar
Adding to already interesting answers:

   - Is there any case where MR is better than Spark? I don't know what cases
   I should be used Spark by MR. When is MR faster than Spark?
   - Many. MR would be better (am not saying faster ;o)) for
 - Very large dataset,
 - Multistage map-reduce flows,
 - Complex map-reduce semantics
  - Spark is definitely better for the classic iterative,interactive
  workloads.
  - Spark is very effective for implementing the concepts of in-memory
  datasets  real time analytics
 - Take a look at the Lambda architecture
  - Also checkout how Ooyala is using Spark in multiple layers 
  configurations. They also have MR in many places
  - In our case, we found Spark very effective for ELT - we would have
  used MR earlier
   -  I know Java, is it worth it to learn Scala for programming to Spark
   or it's okay just with Java?
   - Java will work fine. Especially when Java 8 becomes the norm, we will
  get back some of the elegance
  - I, personally, like Scala  Python lot better than Java. Scala is a
  lot more elegant, but compilations, IDE integration et al are still clunky
  - One word of caution - stick with one language as much as
  possible-shuffling between Java  Scala is not fun

Cheers  HTH
k/

On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote:

 MapReduce is simpler and narrower, which also means it is generally
 lighter weight, with less to know and configure, and runs more predictably.
 If you have a job that is truly just a few maps, with maybe one reduce, MR
 will likely be more efficient. Until recently its shuffle has been more
 developed and offers some semantics the Spark shuffle does not.

 I suppose it integrates with tools like Oozie, that Spark does not.

 I suggest learning enough Scala to use Spark in Scala. The amount you need
 to know is not large.

 (Mahout MR based implementations do not run on Spark and will not. They
 have been removed instead.)
 On Nov 22, 2014 3:36 PM, Guillermo Ortiz konstt2...@gmail.com wrote:

 Hello,

 I'm a newbie with Spark but I've been working with Hadoop for a while.
 I have two questions.

 Is there any case where MR is better than Spark? I don't know what
 cases I should be used Spark by MR. When is MR faster than Spark?

 The other question is, I know Java, is it worth it to learn Scala for
 programming to Spark or it's okay just with Java? I have done a little
 piece of code with Java because I feel more confident with it,, but I
 seems that I'm missed something

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now
has raised the bar with the ability to sort a PB.
Like some of the folks in the list, a summary of what worked (and didn't)
as well as the monitoring practices would be good.
Cheers
k/
P.S: What are you folks planning next ?

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread Krishna Sankar
Hi,
   I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia.
This step only adds the support for Hadoop,Yarn,Hive et al in the spark
executable.No need to run if one is not using them.
Cheers
k/

On Thu, Oct 2, 2014 at 12:29 PM, danilopds danilob...@gmail.com wrote:

 Hi tsingfu,

 I want to see metrics in ganglia too.
 But I don't understand this step:
 ./make-distribution.sh --tgz --skip-java-test -Phadoop-2.3 -Pyarn -Phive
 -Pspark-ganglia-lgpl

 Are you installing the hadoop, yarn, hive AND ganglia??

 If I want to install just ganglia?
 Can you suggest me something?

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p15631.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Thanks Burak. Step size 0.01 worked for b) and step=0.0001 for c) !
Cheers
k/

On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz bya...@stanford.edu wrote:

 Hi,

 It appears that the step size is too high that the model is diverging with
 the added noise.
 Could you try by setting the step size to be 0.1 or 0.01?

 Best,
 Burak

 - Original Message -
 From: Krishna Sankar ksanka...@gmail.com
 To: user@spark.apache.org
 Sent: Wednesday, October 1, 2014 12:43:20 PM
 Subject: MLlib Linear Regression Mismatch

 Guys,
Obviously I am doing something wrong. May be 4 points are too small a
 dataset.
 Can you help me to figure out why the following doesn't work ?
 a) This works :

 data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(10.0, [10.0]),
LabeledPoint(20.0, [20.0]),
LabeledPoint(30.0, [30.0])
 ]
 lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
 initialWeights=array([1.0]))
 print lrm
 print lrm.weights
 print lrm.intercept
 lrm.predict([40])

 output:
 pyspark.mllib.regression.LinearRegressionModel object at 0x109813d50

 [ 1.]
 0.0

 40.0

 b) By perturbing the y a little bit, the model gives wrong results:

 data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(9.0, [10.0]),
LabeledPoint(22.0, [20.0]),
LabeledPoint(32.0, [30.0])
 ]
 lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
 initialWeights=array([1.0])) # should be 1.09x -0.60
 print lrm
 print lrm.weights
 print lrm.intercept
 lrm.predict([40])

 Output:
 pyspark.mllib.regression.LinearRegressionModel object at 0x109666590

 [ -8.20487463e+203]
 0.0

 -3.2819498532740317e+205

 c) Same story here - wrong results. Actually nan:

 data = [
LabeledPoint(18.9, [3910.0]),
LabeledPoint(17.0, [3860.0]),
LabeledPoint(20.0, [4200.0]),
LabeledPoint(16.6, [3660.0])
 ]
 lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
 initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170
 print lrm
 print lrm.weights
 print lrm.intercept
 lrm.predict([4000])

 Output:pyspark.mllib.regression.LinearRegressionModel object at
 0x109666b90

 [ nan]
 0.0

 nan

 Cheers  Thanks
 k/




Re: MLlib 1.2 New Interesting Features

2014-09-29 Thread Krishna Sankar
Thanks Xiangrui. Appreciate the insights.
I have uploaded the initial version of my presentation at
http://goo.gl/1nBD8N
Cheers
k/

On Mon, Sep 29, 2014 at 12:17 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Krishna,

 Some planned features for MLlib 1.2 can be found via Spark JIRA:
 http://bit.ly/1ywotkm , though this list is not fixed. The feature
 freeze will happen by the end of Oct. Then we will cut branch-1.2 and
 start QA. I don't recommend using branch-1.2 for hands-on tutorial
 around Oct 29th because that branch is not full tested at that time.
 You should use 1.1 instead. Its binary packages and documentation can
 be easily found on spark.apache.org, which is important for making
 hands-on tutorial.

 Best,
 Xiangrui

 On Sat, Sep 27, 2014 at 12:15 PM, Krishna Sankar ksanka...@gmail.com
 wrote:
  Guys,
 
  Need help in terms of the interesting features coming up in MLlib 1.2.
  I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con
 
  The Hitchhiker's Guide to Machine Learning with Python  Apache
 Spark[2]
 
  At minimum, it would be good to take the last 30 min to elaborate the new
  features in MLlib coming up in 1.2.
 
  If the features are stable, I might use 1.2 for the tutorial.
  It is hands-on, so want to use a stable Spark version.
 
  What are the salient ML features slated to be part of 1.2 ?
  Which branch should I look at ?
  Will it be stable enough by Oct 29th for the attendees to download ?
 Then I
  can plan the materials around it.
 
  Cheers
  k/
 
 
  [1] My two talks :
 http://www.bigdatatechcon.com/speakers.html#KrishnaSankar
  [2] Spark Talk : http://goo.gl/4Pcvuq
 



MLlib 1.2 New Interesting Features

2014-09-27 Thread Krishna Sankar
Guys,

   - Need help in terms of the interesting features coming up in MLlib 1.2.
   - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con
  - The Hitchhiker's Guide to Machine Learning with Python  Apache
  Spark[2]
   - At minimum, it would be good to take the last 30 min to elaborate the
   new features in MLlib coming up in 1.2.
  - If the features are stable, I might use 1.2 for the tutorial.
  - It is hands-on, so want to use a stable Spark version.


   1. What are the salient ML features slated to be part of 1.2 ?
   2. Which branch should I look at ?
   3. Will it be stable enough by Oct 29th for the attendees to download ?
   Then I can plan the materials around it.

Cheers
k/


[1] My two talks : http://www.bigdatatechcon.com/speakers.html#KrishnaSankar
[2] Spark Talk : http://goo.gl/4Pcvuq


Re: Out of any idea

2014-07-19 Thread Krishna Sankar
Probably you have - if not, try a very simple app in the docker container
and make sure it works. Sometimes resource contention/allocation can get in
the way. This happened to me in the YARN container.
Also try single worker thread.
Cheers
k/


On Sat, Jul 19, 2014 at 2:39 PM, boci boci.b...@gmail.com wrote:

 Hi guys!

 I run out of ideas... I created a spark streaming job (kafka - spark -
 ES).
 If I start my app local machine (inside the editor, but connect to the
 real kafka and ES) the application work correctly.
 If I start it in my docker container (same kafka and ES, local mode
 (local[4]) like inside my editor) the application connect to kafka, receive
 the message but after that nothing happened (I put config/log4j.properties
 to debug mode and I see BlockGenerator receive the data bu after that
 nothing happened with that.
 (first step I simply run a map to print the received data with log4j)

 I hope somebody can help... :(

 b0c1

 --
 Skype: boci13, Hangout: boci.b...@gmail.com



Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar
One vector to check is the HBase libraries in the --jars as in :
spark-submit --class your class --master master url --jars
hbase-client-0.98.3-hadoop2.jar,commons-csv-1.0-SNAPSHOT.jar,hbase-common-0.98.3-hadoop2.jar,hbase-hadoop2-compat-0.98.3-hadoop2.jar,hbase-it-0.98.3-hadoop2.jar,hbase-protocol-0.98.3-hadoop2.jar,hbase-server-0.98.3-hadoop2.jar,htrace-core-2.04.jar,spark-assembly-1.0.0-hadoop2.2.0.jar
badwclient.jar
This worked for us.
Cheers
k/


On Tue, Jul 15, 2014 at 6:47 AM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hi Team,

 Could you please help me to resolve the issue.

 *Issue *: I'm not able to connect HBase from Spark-submit. Below is my
 code.  When i execute below program in standalone, i'm able to connect to
 Hbase and doing the operation.

 When i execute below program using spark submit ( ./bin/spark-submit )
 command, i'm not able to connect to hbase. Am i missing any thing?


 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 import java.util.Properties;

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.hbase.HBaseConfiguration;
 import org.apache.hadoop.hbase.client.Put;
 import org.apache.log4j.Logger;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.streaming.Duration;
 import org.apache.spark.streaming.api.java.JavaDStream;
 import org.apache.spark.streaming.api.java.JavaStreamingContext;
 import org.apache.hadoop.hbase.HTableDescriptor;
 import org.apache.hadoop.hbase.client.HBaseAdmin;

 public class Test {


 public static void main(String[] args) throws Exception {

 JavaStreamingContext ssc = new
 JavaStreamingContext(local,Test, new Duration(4), sparkHome, );

 JavaDStreamString lines_2 = ssc.textFileStream(hdfsfolderpath);

 Configuration configuration = HBaseConfiguration.create();
 configuration.set(hbase.zookeeper.property.clientPort, 2181);
 configuration.set(hbase.zookeeper.quorum, localhost);
 configuration.set(hbase.master, localhost:60);

 HBaseAdmin hBaseAdmin = new HBaseAdmin(configuration);

 if (hBaseAdmin.tableExists(HABSE_TABLE)) {
 System.out.println( ANA_DATA table exists ..);
 }

 System.out.println( HELLO HELLO HELLO );

 ssc.start();
 ssc.awaitTermination();

 }
 }

 Thank you for your help and support.

 Regards,
 Rajesh



Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in
all the nodes irrespective of Hadoop/YARN.
Cheers
k/


On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote:

 I have a Spark app which runs well on local master.  I'm now ready to
 put it on a cluster.  What needs to be installed on the master? What
 needs to be installed on the workers?

 If the cluster already has Hadoop or YARN or Cloudera, does it still
 need an install of Spark?



Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Krishna Sankar
Konstantin,

   1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2,
   the nodes would have hdfs  YARN.
   2. Then you need to install Spark on all nodes. I haven't had experience
   with HDP, but the tech preview might have installed Spark as well.
   3. In the end, one should have hdfs,yarn  spark installed on all the
   nodes.
   4. After installations, check the web console to make sure hdfs, yarn 
   spark are running.
   5. Then you are ready to start experimenting/developing spark
   applications.

HTH.
Cheers
k/


On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev 
kudryavtsev.konstan...@gmail.com wrote:

 guys, I'm not talking about running spark on VM, I don have problem with
 it.

 I confused in the next:
 1) Hortonworks describe installation process as RPMs on each node
 2) spark home page said that everything I need is YARN

 And I'm in stucj with understanding what I need to do to run spark on yarn
 (do I need RPMs installations or only build spark on edge node?)


 Thank you,
 Konstantin Kudryavtsev


 On Mon, Jul 7, 2014 at 4:34 AM, Robert James srobertja...@gmail.com
 wrote:

 I can say from my experience that getting Spark to work with Hadoop 2
 is not for the beginner; after solving one problem after another
 (dependencies, scripts, etc.), I went back to Hadoop 1.

 Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
 why, but, given so, Hadoop 2 has too many bumps

 On 7/6/14, Marco Shaw marco.s...@gmail.com wrote:
  That is confusing based on the context you provided.
 
  This might take more time than I can spare to try to understand.
 
  For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
 
  Cloudera's CDH 5 express VM includes Spark, but the service isn't
 running by
  default.
 
  I can't remember for MapR...
 
  Marco
 
  On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
  kudryavtsev.konstan...@gmail.com wrote:
 
  Marco,
 
  Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that
 you
  can try
  from
 
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
   HDP 2.1 means YARN, at the same time they propose ti install rpm
 
  On other hand, http://spark.apache.org/ said 
  Integrated with Hadoop
  Spark can run on Hadoop 2's YARN cluster manager, and can read any
  existing Hadoop data.
 
  If you have a Hadoop 2 cluster, you can run Spark without any
 installation
  needed. 
 
  And this is confusing for me... do I need rpm installation on not?...
 
 
  Thank you,
  Konstantin Kudryavtsev
 
 
  On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com
  wrote:
  Can you provide links to the sections that are confusing?
 
  My understanding, the HDP1 binaries do not need YARN, while the HDP2
  binaries do.
 
  Now, you can also install Hortonworks Spark RPM...
 
  For production, in my opinion, RPMs are better for manageability.
 
  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
  kudryavtsev.konstan...@gmail.com wrote:
 
  Hello, thanks for your message... I'm confused, Hortonworhs suggest
  install spark rpm on each node, but on Spark main page said that yarn
  enough and I don't need to install it... What the difference?
 
  sent from my HTC
 
  On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote:
  Konstantin,
 
  HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
 can
  try
  from
 
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 
  Let me know if you see issues with the tech preview.
 
  spark PI example on HDP 2.0
 
  I downloaded spark 1.0 pre-build from
  http://spark.apache.org/downloads.html
  (for HDP2)
  The run example from spark web-site:
  ./bin/spark-submit --class org.apache.spark.examples.SparkPi
  --master
  yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory
 2g
  --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
 
  I got error:
  Application application_1404470405736_0044 failed 3 times due to AM
  Container for appattempt_1404470405736_0044_03 exited with
  exitCode: 1
  due to: Exception from container-launch:
  org.apache.hadoop.util.Shell$ExitCodeException:
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
  at org.apache.hadoop.util.Shell.run(Shell.java:379)
  at
 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
  at
 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
  at
 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
  at
 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at 

Re: Spark Processing Large Data Stuck

2014-06-21 Thread Krishna Sankar
Hi,

   - I have seen similar behavior before. As far as I can tell, the root
   cause is the out of memory error - verified this by monitoring the memory.
  - I had a 30 GB file and was running on a single machine with 16GB.
  So I knew it would fail.
  - But instead of raising an exception, some part of the system keeps
  on churning.
   - My suggestion is to follow the memory settings for the JVM (try bigger
   settings), make sure the settings are propagated to all the workers and
   finally monitor the memory while the job is running.
   - Another vector is to split the file, try with progressively increasing
   size.
   - I also see symptoms of failed connections. While I can't positively
   say that it is a problem, check your topology  network connectivity.
   - Out of curiosity, what kind of machines are you running ? Bare metal ?
   EC2 ? How much memory ? 64 bit OS ?
  - I assume these are big machines and so the resources themselves
  might not be a problem.

Cheers
k/


On Sat, Jun 21, 2014 at 12:55 PM, yxzhao yxz...@ualr.edu wrote:

 I run the pagerank example processing a large data set, 5GB in size, using
 48
 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
 attached log shows. It was stuck there for more than 10 hours and then I
 killed it at last. But I did not find any information explaining why it was
 stuck. Any suggestions? Thanks.

 Spark_OK_48_pagerank.log
 
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n8075/Spark_OK_48_pagerank.log
 



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Krishna Sankar
Mahesh,

   - One direction could be : create a parquet schema, convert  save the
   records to hdfs.
   - This might help
   
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
k/


On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc 
mahesh.padmanab...@twc-contractor.com wrote:

 Hello,

 Is there an easy way to convert RDDs within a DStream into Parquet records?
 Here is some incomplete pseudo code:

 // Create streaming context
 val ssc = new StreamingContext(...)

 // Obtain a DStream of events
 val ds = KafkaUtils.createStream(...)

 // Get Spark context to get to the SQL context
 val sc = ds.context.sparkContext

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 // For each RDD
 ds.foreachRDD((rdd: RDD[Array[Byte]]) = {

 // What do I do next?
 })

 Thanks,
 Mahesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Krishna Sankar
Ian,
   Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a
wrapper around the
com.clearspring.analytics.stream.cardinality.HyperLogLogPlus.
Cheers
k/


On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell i...@ianoconnell.com wrote:

 Depending on your requirements when doing hourly metrics calculating
 distinct cardinality, a much more scalable method would be to use a hyper
 log log data structure.
 a scala impl people have used with spark would be
 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala


 On Sun, Jun 15, 2014 at 6:16 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 Vivek,

 If the foldByKey solution doesn't work for you, my team uses
 RDD.persist(DISK_ONLY) to avoid OOM errors.

 It's slower, of course, and requires tuning other config parameters. It
 can also be a problem if you do not have enough disk space, meaning that
 you have to unpersist at the right points if you are running long flows.

 For us, even though the disk writes are a performance hit, we prefer the
 Spark programming model to Hadoop M/R. But we are still working on getting
 this to work end to end on 100s of GB of data on our 16-node cluster.

 Suren



 On Sun, Jun 15, 2014 at 12:08 AM, Vivek YS vivek...@gmail.com wrote:

 Thanks for the input. I will give foldByKey a shot.

 The way I am doing is, data is partitioned hourly. So I am computing
 distinct values hourly. Then I use unionRDD to merge them and compute
 distinct on the overall data.

  Is there a way to know which key,value pair is resulting in the OOM ?
  Is there a way to set parallelism in the map stage so that, each
 worker will process one key at time. ?

 I didn't realise countApproxDistinctByKey is using hyperloglogplus.
 This should be interesting.

 --Vivek


 On Sat, Jun 14, 2014 at 11:37 PM, Sean Owen so...@cloudera.com wrote:

 Grouping by key is always problematic since a key might have a huge
 number of values. You can do a little better than grouping *all* values and
 *then* finding distinct values by using foldByKey, putting values into a
 Set. At least you end up with only distinct values in memory. (You don't
 need two maps either, right?)

 If the number of distinct values is still huge for some keys, consider
 the experimental method countApproxDistinctByKey:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L285

 This should be much more performant at the cost of some accuracy.


 On Sat, Jun 14, 2014 at 1:58 PM, Vivek YS vivek...@gmail.com wrote:

 Hi,
For last couple of days I have been trying hard to get around this
 problem. Please share any insights on solving this problem.

 Problem :
 There is a huge list of (key, value) pairs. I want to transform this
 to (key, distinct values) and then eventually to (key, distinct values
 count)

 On small dataset

 groupByKey().map( x = (x_1, x._2.distinct)) ...map(x = (x_1,
 x._2.distinct.count))

 On large data set I am getting OOM.

 Is there a way to represent Seq of values from groupByKey as RDD and
 then perform distinct over it ?

 Thanks
 Vivek






 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io





Re: Multi-dimensional Uniques over large dataset

2014-06-14 Thread Krishna Sankar
And got the first cut:

val res = pairs.groupByKey().map((x) = (x._1, x._2.size, x._2.toSet.size))
gives the total  unique.

The question : is it scalable  efficient ? Would appreciate insights.

Cheers

k/


On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar ksanka...@gmail.com
wrote:

 Answered one of my questions (#5) : val pairs = new
 PairRDDFunctions(RDD) works fine locally. Now I can do groupByKey et al.
 Am not sure if it is scalable for millions of records  memory efficient.
 heers
 k/


 On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com
 wrote:

 Hi,
Would appreciate insights and wisdom on a problem we are working on:

1. Context:
   - Given a csv file like:
   - d1,c1,a1
   - d1,c1,a2
   - d1,c2,a1
   - d1,c1,a1
   - d2,c1,a3
   - d2,c2,a1
   - d3,c1,a1
   - d3,c3,a1
   - d3,c2,a1
   - d3,c3,a2
   - d5,c1,a3
   - d5,c2,a2
- d5,c3,a2
   - Want to find uniques and totals (of the d_ across the c_ and a_
   dimensions):
   - Tot   Unique
  - c1  6  4
  - c2  4  4
  - c3  2  2
  - a1  7  3
  - a2  4  3
  - a3  2  2
  - c1-a1  ...
  - c1-a2 ...
  - c1-a3 ...
  - c2-a1 ...
  - c2-a2 ...
  - ...
  - c3-a3
   - Obviously there are millions of records and more
   attributes/dimensions. So scalability is key
   2. We think Spark is a good stack for this problem: Have a few
questions:
3. From a Spark substrate perspective, what are some of the optimum
transformations  things to watch out for ?
4. Is PairRDD the best data representation ? GroupByKey et al is only
available for PairRDD.
5. On a pragmatic level, file.map().map() results in RDD. How do I
transform it to a PairRDD ?
   1. .map(fields = (fields(1), fields(0)) - results in Unit
   2. .map(fields = fields(1) - fields(0)) also is not working
   3. Both these do not result in a PairRDD
   4. Am missing something fundamental.

 Cheers  Have a nice weekend
 k/





Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Hi,
   Would appreciate insights and wisdom on a problem we are working on:

   1. Context:
  - Given a csv file like:
  - d1,c1,a1
  - d1,c1,a2
  - d1,c2,a1
  - d1,c1,a1
  - d2,c1,a3
  - d2,c2,a1
  - d3,c1,a1
  - d3,c3,a1
  - d3,c2,a1
  - d3,c3,a2
  - d5,c1,a3
  - d5,c2,a2
  - d5,c3,a2
  - Want to find uniques and totals (of the d_ across the c_ and a_
  dimensions):
  - Tot   Unique
 - c1  6  4
 - c2  4  4
 - c3  2  2
 - a1  7  3
 - a2  4  3
 - a3  2  2
 - c1-a1  ...
 - c1-a2 ...
 - c1-a3 ...
 - c2-a1 ...
 - c2-a2 ...
 - ...
 - c3-a3
  - Obviously there are millions of records and more
  attributes/dimensions. So scalability is key
  2. We think Spark is a good stack for this problem: Have a few
   questions:
   3. From a Spark substrate perspective, what are some of the optimum
   transformations  things to watch out for ?
   4. Is PairRDD the best data representation ? GroupByKey et al is only
   available for PairRDD.
   5. On a pragmatic level, file.map().map() results in RDD. How do I
   transform it to a PairRDD ?
  1. .map(fields = (fields(1), fields(0)) - results in Unit
  2. .map(fields = fields(1) - fields(0)) also is not working
  3. Both these do not result in a PairRDD
  4. Am missing something fundamental.

Cheers  Have a nice weekend
k/


Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD)
works fine locally. Now I can do groupByKey et al. Am not sure if it is
scalable for millions of records  memory efficient.
heers
k/


On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote:

 Hi,
Would appreciate insights and wisdom on a problem we are working on:

1. Context:
   - Given a csv file like:
   - d1,c1,a1
   - d1,c1,a2
   - d1,c2,a1
   - d1,c1,a1
   - d2,c1,a3
   - d2,c2,a1
   - d3,c1,a1
   - d3,c3,a1
   - d3,c2,a1
   - d3,c3,a2
   - d5,c1,a3
   - d5,c2,a2
   - d5,c3,a2
   - Want to find uniques and totals (of the d_ across the c_ and a_
   dimensions):
   - Tot   Unique
  - c1  6  4
  - c2  4  4
  - c3  2  2
  - a1  7  3
  - a2  4  3
  - a3  2  2
  - c1-a1  ...
  - c1-a2 ...
  - c1-a3 ...
  - c2-a1 ...
  - c2-a2 ...
  - ...
  - c3-a3
   - Obviously there are millions of records and more
   attributes/dimensions. So scalability is key
   2. We think Spark is a good stack for this problem: Have a few
questions:
3. From a Spark substrate perspective, what are some of the optimum
transformations  things to watch out for ?
4. Is PairRDD the best data representation ? GroupByKey et al is only
available for PairRDD.
5. On a pragmatic level, file.map().map() results in RDD. How do I
transform it to a PairRDD ?
   1. .map(fields = (fields(1), fields(0)) - results in Unit
   2. .map(fields = fields(1) - fields(0)) also is not working
   3. Both these do not result in a PairRDD
   4. Am missing something fundamental.

 Cheers  Have a nice weekend
 k/



Re: problem starting the history server on EC2

2014-06-10 Thread Krishna Sankar
Yep, it gives tons of errors. I was able to make it work with sudo. Looks
like ownership issue.
Cheers
k/


On Tue, Jun 10, 2014 at 6:29 PM, zhen z...@latrobe.edu.au wrote:

 I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I
 do not seem to be able to start the history server on the master node. I
 used the following command:

 ./start-history-server.sh /root/spark_log


 The error message says that the logging directory /root/spark_log does not
 exist. But I have definitely created the directory and made sure everyone
 can read/write/execute in the directory.

 Can you tell me why it  does not work?

 Thank you

 Zhen



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/problem-starting-the-history-server-on-EC2-tp7361.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Krishna Sankar
Project-Properties-Java Build Path-Add External Jars
Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
Cheers
K/


On Sun, Jun 8, 2014 at 8:06 AM, Carter gyz...@hotmail.com wrote:

 Hi All,

 I just downloaded the Scala IDE for Eclipse. After I created a Spark
 project
 and clicked Run there was an error on this line of code import
 org.apache.spark.SparkContext: object apache is not a member of package
 org. I guess I need to import the Spark dependency into Scala IDE for
 Eclipse, can anyone tell me how to do it? Thanks a lot.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar
chmod 600 path/FinalKey.pem

Cheers

k/


On Wed, Jun 4, 2014 at 12:49 PM, Sam Taylor Steyer sste...@stanford.edu
wrote:

 Also, once my friend logged in to his cluster he received the error
 Permissions 0644 for 'FinalKey.pem' are too open. This sounds like the
 other problem described. How do we make the permissions more private?

 Thanks very much,
 Sam

 - Original Message -
 From: Sam Taylor Steyer sste...@stanford.edu
 To: user@spark.apache.org
 Sent: Wednesday, June 4, 2014 12:42:04 PM
 Subject: Re: Trouble launching EC2 Cluster with Spark

 Thanks you! The regions advice solved the problem for my friend who was
 getting the key pair does not exist problem. I am still getting the error:

 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid
 value 'null' for protocol. VPC security group rules must specify protocols
 explicitly./Message/Error/ErrorsRequestID7ff92687-b95a-4a39-94cb-e2d00a6928fd/RequestID/Response

 This sounds like it could have to do with the access settings of the
 security group, but I don't know how to change. Any advice would be much
 appreciated!

 Sam

 - Original Message -
 From: Krishna Sankar ksanka...@gmail.com
 To: user@spark.apache.org
 Sent: Wednesday, June 4, 2014 8:52:59 AM
 Subject: Re: Trouble launching EC2 Cluster with Spark

 One reason could be that the keys are in a different region. Need to create
 the keys in us-east-1-North Virginia.
 Cheers
 k/


 On Wed, Jun 4, 2014 at 7:45 AM, Sam Taylor Steyer sste...@stanford.edu
 wrote:

  Hi,
 
  I am trying to launch an EC2 cluster from spark using the following
  command:
 
  ./spark-ec2 -k HackerPair -i [path]/HackerPair.pem -s 2 launch
  HackerCluster
 
  I set my access key id and secret access key. I have been getting an
 error
  in the setting up security groups... phase:
 
  Invalid value 'null' for protocol. VPC security groups must specify
  protocols explicitly.
 
  My project partner gets one step further and then gets the error
 
  The key pair 'JamesAndSamTest' does not exist.
 
  Any thoughts as to how we could fix these problems? Thanks a lot!
  Sam
 



Re: Spark Usecase

2014-06-04 Thread Krishna Sankar
Shahab,
   Interesting question. Couple of points (based on the information from
your e-mail)

   1. One can support the use case in Spark as a set of transformations on
   a WIP TDD over a span of time and the final transformation outputting to a
   processed TDD
  - Spark streaming would be a good data ingestion mechanism - look at
  the system as a pipeline that spans a time window
  - Depending on the cardinality, you would need a correlation id to
  transform the pipeline as you get more data
  2.  Having said that, you do have to understand what value spark
   provides,  then design the topology to support that.
  - For example, you could potentially keep all the WIP in HBase  the
  final transformations in Spark TDD.
  - Or may be you keep all the WIP in Spark and the final processed
  records in HBase. There is nothing wrong in keeping WIP in Spark, if
  response time to process the incoming data set is important.
   3. Naturally start with a set of ideas, make a few assumptions and do an
   e2e POC. That will clear many of the questions and firm up the design.

HTH.
Cheers
k/


On Wed, Jun 4, 2014 at 6:57 AM, Shahab Yunus shahab.yu...@gmail.com wrote:

 Hello All.

 I have a newbie question.

 We have a use case where huge amount of data will be coming in streams or
 micro-batches of streams and we want to process these streams according to
 some business logic. We don't have to provide extremely low latency
 guarantees but batch M/R will still be slow.

 Now the business logic is such that at the time of emitting the data, we
 might have to hold on to some tuples until we get more information. This
 'more' information is essentially will be coming in streams of future
 streams.

 You can say that this is kind of *word count* use case where we have to 
 *aggregate
 and maintain state across batches of streams.* One thing different here
 is that we might have to* maintain the state or data for a day or two*
 until rest of the data comes in and then we can complete our output.

 1- Questions is that is such is use cases supported in Spark and/or Spark
 Streaming?
 2- Will we be able to persist partially aggregated data until the rest of
 the information comes in later in time? I am mentioning *persistence*
 here that given that the delay can be spanned over a day or two we won't
 want to keep the partial data in memory for so long.

 I know this can be done in Storm but I am really interested in Spark
 because of its close integration with Hadoop. We might not even want to use
 Spark Streaming (which is more of a direct comparison with Storm/Trident)
 given our  application does not have to be real-time in split-second.

 Feel free to direct me to any document or resource.

 Thanks a lot.

 Regards,
 Shahab



Re: Why Scala?

2014-05-29 Thread Krishna Sankar
Nicholas,
   Good question. Couple of thoughts from my practical experience:

   - Coming from R, Scala feels more natural than other languages. The
   functional  succinctness of Scala is more suited for Data Science than
   other languages. In short, Scala-Spark makes sense, for Data Science, ML,
   Data Exploration et al
   - Having said that occasionally practicality does trump the choice of a
   language - last time I really wanted to use Scala but ended up in writing
   in Python ! Hope to get a better result this time
   - Language evolution is more of a long term granularity -  we do
   underestimate the velocity  impact. Have seen evolutions through languages
   starting from Cobol, CCP/M Basic,Turbo Pascal, ... I think Scala will find
   it's equilibrium sooner than we think ...

Cheers
k/


On Thu, May 29, 2014 at 5:54 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Thank you for the specific points about the advantages Scala provides over
 other languages. Looking at several code samples, the reduction of
 boilerplate code over Java is one of the biggest plusses, to me.

 On Thu, May 29, 2014 at 8:10 PM, Marek Kolodziej mkolod@gmail.com
 wrote:

 I would advise others to form their opinions based on experiencing it for
 themselves, rather than reading what random people say on Hacker News. :)


 Just a nitpick here: What I said was It looks like the language is fairly
 controversial on [Hacker News.] That was just an observation of what I saw
 on HN, not a statement of my opinion. I know very little about Scala (or
 Java, for that matter) and definitely don't have a well-formed opinion on
 the matter.

 Nick



Re: K-nearest neighbors search in Spark

2014-05-27 Thread Krishna Sankar
Carter,
   Just as a quick  simple starting point for Spark. (caveats - lots of
improvements reqd for scaling, graceful and efficient handling of RDD et
al):

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import scala.collection.immutable.ListMap

import scala.collection.immutable.SortedMap

object TopK {

  //

  def getCurrentDirectory = new java.io.File( . ).getCanonicalPath

  //

  def distance(x1:List[Int],x2:List[Int]):Double = {

val dist:Double = math.sqrt(math.pow(x1(1)-x2(1),2) + math.pow(x1(2)-x2(
2),2))

dist

  }

  //

  def main(args: Array[String]): Unit = {

//

println(getCurrentDirectory)

val sc = new SparkContext(local,TopK,
spark://USS-Defiant.local:7077)

println(sRunning Spark Version ${sc.version})

val file = sc.textFile(data01.csv)

//

val data = file

  .map(line = line.split(,))

  .map(x1 = List(x1(0).toInt,x1(1).toInt,x1(2).toInt))

//val data1 = data.collect

println(data)

for (d - data) {

  println(d)

  println(d(0))

}

//

val distList = for (d - data) yield {d(0)}

//for (d - distList) (println(d))

val zipList = for (a - distList.collect; b - distList.collect)
yield{ List(
a,b)}

zipList.foreach(println(_))

//

val dist = for (l - zipList) yield {

  println(s${l(0)} = ${l(1)})

  val x1a:Array[List[Int]] = data.filter(d = d(0) == l(0)).collect

  val x2a:Array[List[Int]] = data.filter(d = d(0) == l(1)).collect

  val x1:List[Int] = x1a(0)

  val x2:List[Int] = x2a(0)

  val dist = distance(x1,x2)

  Map ( dist - l )

  }

dist.foreach(println(_)) // sort this for topK

//

  }

}

data01.csv

1,68,93

2,12,90

3,45,76

4,86,54

HTH.

Cheers
k/


On Tue, May 27, 2014 at 4:10 AM, Carter gyz...@hotmail.com wrote:

 Any suggestion is very much appreciated.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: How to Run Machine Learning Examples

2014-05-22 Thread Krishna Sankar
I couldn't find the classification.SVM class.

   - Most probably the command is something of the order of:
   - bin/spark-submit --class
  org.apache.spark.examples.mllib.BinaryClassification
  examples/target/scala-*/spark-examples-*.jar --algorithm SVM  train.csv
   - For more details try
  - ./bin/run-example mllib.BinaryClassification spark url
 - Usage: BinaryClassification [options] input
 -   --numIterations value   number of iterations
 -   --stepSize valueinitial step size, default: 1.0
 -   --algorithm value algorithm (SVM,LR), default: LR
 -   --regType value regularization type (L1,L2),
 default: L2
 -   --regParam value regularization parameter, default:
 0.1
 -   input input paths to labeled examples in LIBSVM
 format

HTH.

Cheers

k/

P.S: I am using 1.0.0 rc10. Even for earlier release, just run the
classification class and it will tell you what the parameters are. Most
probably SVM is an algorithm parameter not a class by itself.


On Thu, May 22, 2014 at 2:12 PM, yxzhao yxz...@ualr.edu wrote:

 Thanks Stephen,
 I used the following commnad line to run the SVM, but it seems that the
 path is not correct. What the right path or command line should be? Thanks.
 *./bin/run-example org.apache.spark.mllib.classification.SVM
 spark://100.1.255.193:7077 http://100.1.255.193:7077 train.csv 20*
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/mllib/classification/SVM
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.mllib.classification.SVM
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 Could not find the main class: org.apache.spark.mllib.classification.SVM.
  Program will exit.







 On Thu, May 22, 2014 at 3:05 PM, Stephen Boesch [via Apache Spark User
 List] [hidden email] http://user/SendEmail.jtp?type=nodenode=6287i=0
 wrote:
 
  There is a bin/run-example.sh example-class [args]
 
 
  2014-05-22 12:48 GMT-07:00 yxzhao [hidden email]:
 
  I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
  following directory on my data set. But I did not find the sample
 command
  line to run them. Anybody help? Thanks.
 
 
 spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 
 
  
  If you reply to this email, your message will be added to the discussion
  below:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277p6278.html
  To unsubscribe from How to Run Machine Learning Examples, click here.
  NAML



 --
 View this message in context: Re: How to Run Machine Learning 
 Exampleshttp://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277p6287.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Krishna Sankar
It depends on what stack you want to run. A quick cut:

   - Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes)
  - Dual 6 core CPU
  - 64 to 128 GB RAM
  - 3 X 3TB disk (JBOD)
   - Master Node (Name Node, HBase Master,Spark Master)
  - Dual 6 core CPU
  - 64 to 128 GB RAM
  - 2 X 3TB disk (RAID 1+0)
   - Start with a 5 node setup and scale out as needed
   - If your load is Mapreduce over HDFS, then run YRAN
   - If your load is HBase over HDFS, scale depending on the computational
   and storage needs
   - If you are running Spark over HDFS, scale appropriately - you might
   need more memory in the worker nodes
   - In any case, have a topology and the processes that they would run. As
   Soumya suggests, you can prototype at an appropriate  scale using AWS.

Cheers
k/.


On Wed, May 21, 2014 at 5:14 PM, Upender Nimbekar upent...@gmail.comwrote:

 Hi,
 I would like to setup apache platform on a mini cluster. Is there any
 recommendation for the hardware that I can buy to set it up. I am thinking
 about processing significant amount of data like in the range of few
 terabytes.

 Thanks
 Upender