Mllib / kalman

2018-12-17 Thread robin . east
Pretty sure there is nothing in MLLib. This seems to be the most comprehensive 
coverage of implementing in Spark 
https://dzone.com/articles/kalman-filter-with-apache-spark-streaming-and-kafk. 
I’ve skimmed it but not read it in detail but looks useful.

Sent from Polymail ( 
https://polymail.io/?utm_source=polymail_medium=referral_campaign=signature
 )

On Mon, 17 Dec 2018 at 14:00 Laurent Thiebaud < Laurent Thiebaud ( Laurent 
Thiebaud  ) > wrote:

> 
> 
> Hi everybody,
> 
> 
> Is there any built-in implementation of Kalman filter with spark mllib? Or
> any other filter to achieve the same result? What's the state of the art
> about it?
> 
> 
> Thanks.
>

Re: Positive log-likelihood with Gaussian mixture

2018-05-30 Thread robin . east
Positive log likelihoods for continuous distributions are not unusual. You are 
evaluating a pdf not a probability. For example a univariate Gaussian pdf 
returns greater than 1 at the mean when the variance goes below 0.39, at which 
point the log pdf is positive.

Sent from Polymail ( 
https://polymail.io/?utm_source=polymail_medium=referral_campaign=signature
 )

On Tue, 29 May 2018 at 12:08 Simon Dirmeier < Simon Dirmeier ( Simon Dirmeier 
 ) > wrote:

> 
> 
> Hey,
> 
> sorry for the late reply. I cannot share the data but the problem can be
> reproduced easily, like below.
> I wanted to check with sklearn and observe a similar behaviour, i.e. a
> positive per-sample average log-likelihood ( 
> http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.score
> ).
> 
> I don't think it is necessarily an issue with the implementation, but
> maybe due to parameter identifiability or so?
> As far as I can tell, the variances seem to be ok.
> 
> Thanks for looking into this.
> 
> Best,
> Simon
> 
> 
> 
> import scipy
> import sklearn.mixture
> from scipy.stats import multivariate_normal
> from sklearn.mixture import GaussianMixture
> 
> scipy.random.seed(23)
> X = multivariate_normal.rvs(mean=scipy.ones(10), size=100)
> 
> dff = map(lambda x: (int(x[0]), Vectors.dense(x[0:])), X)
> df = spark.createDataFrame(dff, schema=["label", "features"])
> 
> for i in [100, 90, 80, 70, 60, 50]:
>     km = pyspark.ml.clustering.GaussianMixture(k=10,
> seed=23).fit(df.limit(i))
>     sk_gmm = sklearn.mixture.GaussianMixture(10,
> random_state=23).fit(X[:i, :])
>     print(df.limit(i).count(), X[:i, :].shape[0],
> km.summary.logLikelihood, sk_gmm.score(X[:i, :]))
> 
> 100 100 368.37475644171036 -1.54949312502 90 90 1026.084529101155
> 1.16196607062 80 80 2245.427539835042 4.25769131857 70 70
> 1940.0122633489268 10.0949992881 60 60 2255.002313247103 14.0497823725 50
> 50 -140.82605873444814 21.2423016046
>

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Robin East
sc.textFile will use the Hadoop TextInputFormat (I believe), this will use the 
Hadoop block size to read records from HDFS. Most likely the block size is 
128MB. Not sure you can do anything about the number of tasks generated to read 
from HDFS.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 25 Jul 2017, at 13:21, Gokula Krishnan D <email2...@gmail.com> wrote:
> 
> In addition to that, 
> 
> tried to read the same file with 3000 partitions but it used 3070 partitions. 
> And took more time than previous please refer the attachment. 
> 
> Thanks & Regards, 
> Gokula Krishnan (Gokul)
> 
> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com 
> <mailto:email2...@gmail.com>> wrote:
> Hello All, 
> 
> I have a HDFS file with approx. 1.5 Billion records with 500 Part files 
> (258.2GB Size) and when I tried to execute the following I could see that it 
> used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it?
> 
> val inputFile = 
> val inputRdd = sc.textFile(inputFile)
> inputRdd.count()
> 
> I was hoping that I can do the same with the fewer partitions so tried the 
> following 
> 
> val inputFile = 
> val inputrddnqew = sc.textFile(inputFile,500)
> inputRddNew.count()
> 
> But still it used 2290 tasks.
> 
> As per scala doc, it supposed use as like the HDFS file i.e 500. 
> 
> It would be great if you could throw some insight on this. 
> 
> Thanks & Regards, 
> Gokula Krishnan (Gokul)
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>


Re: Question on Spark's graph libraries

2017-03-10 Thread Robin East
I would love to know the answer to that too.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 9 Mar 2017, at 17:42, enzo <e...@smartinsightsfromdata.com> wrote:
> 
> I am a bit confused by the current roadmap for graph and graph analytics in 
> Apache Spark.
> 
> I understand that we have had for some time two libraries (the following is 
> my understanding - please amend as appropriate!):
> 
> . GraphX, part of Spark project.  This library is based on RDD and it is only 
> accessible via Scala.  It doesn’t look that this library has been enhanced 
> recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This library 
> is based on Spark DataFrames and accessible by Scala & Python. Last commit on 
> GitHub was 2 months ago.
> 
> GraphFrames cam about with the promise at some point to be integrated in 
> Apache Spark.
> 
> I can see other projects coming up with interesting libraries and ideas (e.g. 
> Graphulo on Accumulo, a new project with the goal of implementing the 
> GraphBlas building blocks for graph algorithms on top of Accumulo).
> 
> Where is Apache Spark going?
> 
> Where are graph libraries in the roadmap?
> 
> 
> 
> Thanks for any clarity brought to this matter.
> 
> Enzo



Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread Robin East
As Jorn says there is no best. I would start with 
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101. This will 
help you form some meaningful questions about what tools suit which use cases. 
Most places have a selection of tools such as spark, kafka, flink, storm, flume 
and so on. 
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 9 Mar 2017, at 20:04, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> I find this question strange. There is no best tool for every use case. For 
> example, both tools mentioned below are suitable for different purposes, 
> sometimes also complementary.
> 
>> On 9 Mar 2017, at 20:37, Gaurav1809 <gauravhpan...@gmail.com> wrote:
>> 
>> Hi All, Would you please let me know which streaming platform is best. Be it
>> server log processing, social media feeds ot any such streaming data. I want
>> to know the comparison between Kafka & Spark Streaming.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Which-streaming-platform-is-best-Kafka-or-Spark-Streaming-tp28474.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Re: Pretrained Word2Vec models

2016-12-05 Thread Robin East
There is a JIRA and a pull request for this - 
https://issues.apache.org/jira/browse/SPARK-15328 - however there has been no 
movement on this for a few months. As you’ll see from the pull request review I 
was able to load the Google News Model but could not get it to run.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 5 Dec 2016, at 21:34, Lee Becker <lee.bec...@hapara.com> wrote:
> 
> Hi all,
> 
> Is there a way for Spark to load Word2Vec models trained using gensim 
> <https://radimrehurek.com/gensim/> or the original C implementation 
> <https://code.google.com/archive/p/word2vec/> of Word2Vec?  Specifically I'd 
> like to play with the Google News model 
> <https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing>
>  or the Freebase model 
> <https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing>
>  to see how they perform before training my own.
> 
> Thanks,
> Lee



Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-07 Thread Robin East
If you have to use SGD then scaling will usually help your algorithm to 
converge quicker. If possible you should try using Linear Regression in the 
newer ml library: 
http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression


---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 7 Nov 2016, at 15:47, Carlo.Allocca <carlo.allo...@open.ac.uk> wrote:
> 
> Hi Masood, 
> 
> thank you very much for the reply. It is very a good point as I am getting 
> very bed result so far. 
> 
> If I understood well what you suggest is to scale the date below (it is part 
> of my dataset) before applying linear regression SGD.
> 
> is it correct?
> 
> Many Thanks in advance. 
> 
> Best Regards,
> Carlo 
> 
> 
> 
>> On 7 Nov 2016, at 15:31, Masood Krohy <masood.kr...@intact.net 
>> <mailto:masood.kr...@intact.net>> wrote:
>> 
>> If you go down this route (look at actual coefficients/weights), then make 
>> sure your features are scaled first and have more or less the same mean when 
>> feeding them into the algo. If not, then actual coefficients/weights 
>> wouldn't tell you much. In any case, SGD performs badly with unscaled 
>> features, so you gain if you scale the features beforehand.
>> Masood 
>> 
>> --
>> Masood Krohy, Ph.D. 
>> Data Scientist, Intact Lab-R 
>> Intact Financial Corporation 
>> http://ca.linkedin.com/in/masoodkh <http://ca.linkedin.com/in/masoodkh> 
>> 
>> 
>> 
>> De :Carlo.Allocca <carlo.allo...@open.ac.uk 
>> <mailto:carlo.allo...@open.ac.uk>> 
>> A :Mohit Jaggi <mohitja...@gmail.com <mailto:mohitja...@gmail.com>> 
>> Cc :Carlo.Allocca <carlo.allo...@open.ac.uk 
>> <mailto:carlo.allo...@open.ac.uk>>, "user@spark.apache.org 
>> <mailto:user@spark.apache.org>" <user@spark.apache.org 
>> <mailto:user@spark.apache.org>> 
>> Date :2016-11-04 03:39 
>> Objet :Re: LinearRegressionWithSGD and Rank Features By Importance 
>> 
>> 
>> 
>> Hi Mohit, 
>> 
>> Thank you for your reply. 
>> OK. it means coefficient with high score are more important that other with 
>> low score…
>> 
>> Many Thanks,
>> Best Regards,
>> Carlo
>> 
>> 
>> > On 3 Nov 2016, at 20:41, Mohit Jaggi <mohitja...@gmail.com 
>> > <mailto:mohitja...@gmail.com>> wrote:
>> > 
>> > For linear regression, it should be fairly easy. Just sort the 
>> > co-efficients :)
>> > 
>> > Mohit Jaggi
>> > Founder,
>> > Data Orchard LLC
>> > www.dataorchardllc.com 
>> > 
>> > 
>> > 
>> > 
>> >> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <carlo.allo...@open.ac.uk 
>> >> <mailto:carlo.allo...@open.ac.uk>> wrote:
>> >> 
>> >> Hi All,
>> >> 
>> >> I am using SPARK and in particular the MLib library.
>> >> 
>> >> import org.apache.spark.mllib.regression.LabeledPoint;
>> >> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> >> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>> >> 
>> >> For my problem I am using the LinearRegressionWithSGD and I would like to 
>> >> perform a “Rank Features By Importance”.
>> >> 
>> >> I checked the documentation and it seems that does not provide such 
>> >> methods.
>> >> 
>> >> Am I missing anything?  Please, could you provide any help on this?
>> >> Should I change the approach?
>> >> 
>> >> Many Thanks in advance,
>> >> 
>> >> Best Regards,
>> >> Carlo
>> >> 
>> >> 
>> >> -- The Open University is incorporated by Royal Charter (RC 000391), an 
>> >> exempt charity in England & Wales and a charity registered in Scotland 
>> >> (SC 038302). The Open University is authorised and regulated by the 
>> >> Financial Conduct Authority.
>> >> 
>> >> -
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> >> <mailto:user-unsubscr...@spark.apache.org>
>> >> 
>> > 
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
>> 
> 



Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-04 Thread Robin East
Hi 

Do you mean the test of significance that you usually get with R output? I 
don’t think there is anything implemented in the standard MLLib libraries 
however I believe that the sparkR version provides that. See 
http://spark.apache.org/docs/1.6.2/sparkr.html#gaussian-glm-model

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 4 Nov 2016, at 07:38, Carlo.Allocca <carlo.allo...@open.ac.uk> wrote:
> 
> Hi Mohit, 
> 
> Thank you for your reply. 
> OK. it means coefficient with high score are more important that other with 
> low score…
> 
> Many Thanks,
> Best Regards,
> Carlo
> 
> 
>> On 3 Nov 2016, at 20:41, Mohit Jaggi <mohitja...@gmail.com> wrote:
>> 
>> For linear regression, it should be fairly easy. Just sort the co-efficients 
>> :)
>> 
>> Mohit Jaggi
>> Founder,
>> Data Orchard LLC
>> www.dataorchardllc.com
>> 
>> 
>> 
>> 
>>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <carlo.allo...@open.ac.uk> wrote:
>>> 
>>> Hi All,
>>> 
>>> I am using SPARK and in particular the MLib library.
>>> 
>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>>> 
>>> For my problem I am using the LinearRegressionWithSGD and I would like to 
>>> perform a “Rank Features By Importance”.
>>> 
>>> I checked the documentation and it seems that does not provide such methods.
>>> 
>>> Am I missing anything?  Please, could you provide any help on this?
>>> Should I change the approach?
>>> 
>>> Many Thanks in advance,
>>> 
>>> Best Regards,
>>> Carlo
>>> 
>>> 
>>> -- The Open University is incorporated by Royal Charter (RC 000391), an 
>>> exempt charity in England & Wales and a charity registered in Scotland (SC 
>>> 038302). The Open University is authorised and regulated by the Financial 
>>> Conduct Authority.
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
>> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I agree with Koert. Relying on something because it appears to work when you 
test it can be dangerous if there is nothing in the api guarantee. 

Going back quite a few years it used to be the case that Oracle would always 
return a group by with the rows in the order of the grouping key. This was the 
result of the implementation specifics of GROUP BY. Then at some point Oracle 
introduce a new hashing GROUP BY mechanism that could be chosen for the cost 
based optimizer and all of a sudden lots of people’s applications ‘broke’ 
because they had been relying on functionality that had always worked in the 
past but wasn’t actually guaranteed.

TLDR - don’t rely on functionality that isn’t specified


> On 3 Nov 2016, at 14:37, Koert Kuipers  wrote:
> 
> i did not check the claim in that blog post that the data is ordered, but i 
> wouldnt rely on that behavior since it is not something the api guarantees 
> and could change in future versions
> 
> On Thu, Nov 3, 2016 at 9:59 AM, Rabin Banerjee  > wrote:
> Hi Koert & Robin ,
> 
>   Thanks ! But if you go through the blog 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> 
>  and check the comments under the blog it's actually working, although I am 
> not sure how . And yes I agree a custom aggregate UDAF is a good option . 
> 
> Can anyone share the best way to implement this in Spark .?
> 
> Regards,
> Rabin Banerjee 
> 
> On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers  > wrote:
> Just realized you only want to keep first element. You can do this without 
> sorting by doing something similar to min or max operation using a custom 
> aggregator/udaf or reduceGroups on Dataset. This is also more efficient.
> 
> 
> On Nov 3, 2016 7:53 AM, "Rabin Banerjee"  > wrote:
> Hi All ,
> 
>   I want to do a dataframe operation to find the rows having the latest 
> timestamp in each group using the below operation 
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
> 
> Spark Version :: 1.6.x
> My Question is "Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??
> 
> I referred a blog here :: 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> 
> Which claims it will work except in Spark 1.5.1 and 1.5.2 .
> 
> I need a bit elaboration of how internally spark handles it ? also is it more 
> efficient than using a Window function ?
> 
> Thanks in Advance ,
> Rabin Banerjee
> 
> 
> 
> 



Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I don’t think the semantics of groupBy necessarily preserve ordering - whatever 
the implementation details or the observed behaviour. I would use a Window 
operation and order within the group.




> On 3 Nov 2016, at 11:53, Rabin Banerjee  wrote:
> 
> Hi All ,
> 
>   I want to do a dataframe operation to find the rows having the latest 
> timestamp in each group using the below operation 
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
> 
> Spark Version :: 1.6.x
> My Question is "Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??
> 
> I referred a blog here :: 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> 
> Which claims it will work except in Spark 1.5.1 and 1.5.2 .
> 
> I need a bit elaboration of how internally spark handles it ? also is it more 
> efficient than using a Window function ?
> 
> Thanks in Advance ,
> Rabin Banerjee
> 
> 



Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
Fit it on training data to evaluate the model. You can either use that model to 
apply to unseen data or you can re-train (fit) on all your data before applying 
it to unseen data.

fit and transform are 2 different things: fit creates a model, transform 
applies a model to data to create transformed output. If you are using your 
training data in a subsequent step (e.g. running logistic regression or some 
other machine learning algorithm) then you need to transform your training data 
using the IDF model before passing it through the next step.

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xactlycorp.com> wrote:
> 
> Just to re-iterate what you said, I should fit IDF model only on training 
> data and then re-use it for both test data and then later on unseen data to 
> make predictions.
> 
> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.e...@xense.co.uk 
> <mailto:robin.e...@xense.co.uk>> wrote:
> The point of setting aside a portion of your data as a test set is to try and 
> mimic applying your model to unseen data. If you fit your IDF model to all 
> your data, any evaluation you perform on your test set is likely to over 
> perform compared to ‘real’ unseen data. Effectively you would have overfit 
> your model.
> -------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action 
> <http://www.manning.com/books/spark-graphx-in-action>
> 
> 
> 
> 
> 
>> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com 
>> <mailto:npa...@xactlycorp.com>> wrote:
>> 
>> FYI, I do reuse IDF model while making prediction against new unlabeled data 
>> but not between training and test data while training a model. 
>> 
>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com 
>> <mailto:npa...@xactlycorp.com>> wrote:
>> I am using IDF estimator/model (TF-IDF) to convert text features into 
>> vectors. Currently, I fit IDF model on all sample data and then transform 
>> them. I read somewhere that I should split my data into training and test 
>> before fitting IDF model; Fit IDF only on training data and then use same 
>> transformer to transform training and test data. 
>> This raise more questions:
>> 1) Why would you do that? What exactly do IDF learn during fitting process 
>> that it can reuse to transform any new dataset. Perhaps idea is to keep same 
>> value for |D| and DF|t, D| while use new TF|t, D| ?
>> 2) If not then fitting and transforming seems redundant for IDF model
>> 
>> 
>> 
>> 
>>  <http://www.xactlycorp.com/email-click/>
>> 
>>  <https://www.nyse.com/quote/XNYS:XTLY>   
>> <https://www.linkedin.com/company/xactly-corporation>   
>> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
>> <http://www.youtube.com/xactlycorporation>
> 
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   
> <https://www.linkedin.com/company/xactly-corporation>   
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
> <http://www.youtube.com/xactlycorporation>


Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
The point of setting aside a portion of your data as a test set is to try and 
mimic applying your model to unseen data. If you fit your IDF model to all your 
data, any evaluation you perform on your test set is likely to over perform 
compared to ‘real’ unseen data. Effectively you would have overfit your model.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote:
> 
> FYI, I do reuse IDF model while making prediction against new unlabeled data 
> but not between training and test data while training a model. 
> 
> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com 
> <mailto:npa...@xactlycorp.com>> wrote:
> I am using IDF estimator/model (TF-IDF) to convert text features into 
> vectors. Currently, I fit IDF model on all sample data and then transform 
> them. I read somewhere that I should split my data into training and test 
> before fitting IDF model; Fit IDF only on training data and then use same 
> transformer to transform training and test data. 
> This raise more questions:
> 1) Why would you do that? What exactly do IDF learn during fitting process 
> that it can reuse to transform any new dataset. Perhaps idea is to keep same 
> value for |D| and DF|t, D| while use new TF|t, D| ?
> 2) If not then fitting and transforming seems redundant for IDF model
> 
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   
> <https://www.linkedin.com/company/xactly-corporation>   
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
> <http://www.youtube.com/xactlycorporation>


Re: Need help with SVM

2016-10-26 Thread Robin East
It looks like the training is over-regularised - dropping the regParam to 0.1 
or 0.01 should resolve the problem.

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 26 Oct 2016, at 11:05, Aseem Bansal <asmbans...@gmail.com> wrote:
> 
> He replied to me. Forwarding to the mailing list. 
> 
> -- Forwarded message --
> From: Aditya Vyas <adityavya...@gmail.com <mailto:adityavya...@gmail.com>>
> Date: Tue, Oct 25, 2016 at 8:16 PM
> Subject: Re: Need help with SVM
> To: Aseem Bansal <asmbans...@gmail.com <mailto:asmbans...@gmail.com>>
> 
> 
> Hello,
> Here is the public 
> gist:https://gist.github.com/aditya1702/760cd5c95a6adf2447347e0b087bc318 
> <https://gist.github.com/aditya1702/760cd5c95a6adf2447347e0b087bc318>
> 
> Do tell if you need more information
> 
> Regards,
> Aditya
> 
> On Tue, Oct 25, 2016 at 8:11 PM, Aseem Bansal <asmbans...@gmail.com 
> <mailto:asmbans...@gmail.com>> wrote:
> Is there any labeled point with label 0 in your dataset? 
> 
> On Tue, Oct 25, 2016 at 2:13 AM, aditya1702 <adityavya...@gmail.com 
> <mailto:adityavya...@gmail.com>> wrote:
> Hello,
> I am using linear SVM to train my model and generate a line through my data.
> However my model always predicts 1 for all the feature examples. Here is my
> code:
> 
> print data_rdd.take(5)
> [LabeledPoint(1.0, [1.9643,4.5957]), LabeledPoint(1.0, [2.2753,3.8589]),
> LabeledPoint(1.0, [2.9781,4.5651]), LabeledPoint(1.0, [2.932,3.5519]),
> LabeledPoint(1.0, [3.5772,2.856])]
> 
> 
> from pyspark.mllib.classification import SVMWithSGD
> from pyspark.mllib.linalg import Vectors
> from sklearn.svm import SVC
> data_rdd=x_df.map(lambda x:LabeledPoint(x[1],x[0]))
> 
> model = SVMWithSGD.train(data_rdd, iterations=1000,regParam=1)
> 
> X=x_df.map(lambda x:x[0]).collect()
> Y=x_df.map(lambda x:x[1]).collect()
> 
> 
> pred=[]
> for i in X:
>   pred.append(model.predict(i))
> print pred
> 
> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1]
> 
> 
> My dataset is as follows:
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27955/Screen_Shot_2016-10-25_at_2.png
>  
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27955/Screen_Shot_2016-10-25_at_2.png>>
> 
> 
> Can someone please help?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-with-SVM-tp27955.html
>  
> <http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-with-SVM-tp27955.html>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
> 
> 



Re: Need help with SVM

2016-10-26 Thread Robin East
As per Assem’s point what do you get from 
data_rdd.toDF.groupBy("label").count.show




> On 25 Oct 2016, at 15:41, Aseem Bansal  wrote:
> 
> Is there any labeled point with label 0 in your dataset? 
> 
> On Tue, Oct 25, 2016 at 2:13 AM, aditya1702  > wrote:
> Hello,
> I am using linear SVM to train my model and generate a line through my data.
> However my model always predicts 1 for all the feature examples. Here is my
> code:
> 
> print data_rdd.take(5)
> [LabeledPoint(1.0, [1.9643,4.5957]), LabeledPoint(1.0, [2.2753,3.8589]),
> LabeledPoint(1.0, [2.9781,4.5651]), LabeledPoint(1.0, [2.932,3.5519]),
> LabeledPoint(1.0, [3.5772,2.856])]
> 
> 
> from pyspark.mllib.classification import SVMWithSGD
> from pyspark.mllib.linalg import Vectors
> from sklearn.svm import SVC
> data_rdd=x_df.map(lambda x:LabeledPoint(x[1],x[0]))
> 
> model = SVMWithSGD.train(data_rdd, iterations=1000,regParam=1)
> 
> X=x_df.map(lambda x:x[0]).collect()
> Y=x_df.map(lambda x:x[1]).collect()
> 
> 
> pred=[]
> for i in X:
>   pred.append(model.predict(i))
> print pred
> 
> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1]
> 
> 
> My dataset is as follows:
>   
> >
> 
> 
> Can someone please help?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-with-SVM-tp27955.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



Re: K-Mean retrieving Cluster Members

2016-10-19 Thread Robin East

or alternatively this should work (assuming parsedData is an RDD[Vector]):

clusters.predict(parsedData)





> On 18 Oct 2016, at 00:35, Reth RM  wrote:
> 
> I think I got it
>  
>   parsedData.foreach(
> new VoidFunction() {
> @Override
> public void call(Vector vector) throws Exception {
> System.out.println(clusters.predict(vector));
> 
> }
> }
> );
> 
> 
> On Mon, Oct 17, 2016 at 10:56 AM, Reth RM  > wrote:
> Could you please point me to sample code to retrieve the cluster members of K 
> mean?
> 
> The below code prints cluster centers.  I needed cluster members belonging to 
> each center.
> 
>  
> val clusters = KMeans.train(parsedData, numClusters, numIterations) 
> clusters.clusterCenters.foreach(println)
> 



Re: Graphhopper/routing in Spark

2016-09-09 Thread Robin East
It’s not obvious to me how that would work. In principle I imagine you could 
have your source data loaded into HDFS and read by GraphHopper instances 
running on Spark workers. But a graph by it’s nature has items that have 
connections to potentially any other item so GraphHopper instances would need 
to have a way of dealing with that and I presume GraphHopper is not designed 
that way. Spark’s Graph processing library, GraphX, was designed that way and 
plenty of thought has gone into how to distribute a graph across machines and 
still have a way of running algorithms.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 8 Sep 2016, at 22:45, kodonnell <kane.odonn...@datamine.com> wrote:
> 
> Just wondering if anyone has experience at running Graphhopper (or similar)
> in Spark?
> 
> In short, I can get it running in the master, but not in worker nodes. The
> key trouble seems to be that Graphhopper depends on a pre-processed graph,
> which it obtains from OSM data. In normal (desktop) use, it pre-processes,
> and then caches to disk. My current thinking is that I could create the
> cache locally, and then put it in HDFS, and tweak Graphhopper to read from
> the HDFS source. Alternatively I could try to broadcast the cache (or the
> entire Graphhopper instance) - though I believe that would require both
> being serializable (which I've got little clue about). Does anyone have any
> recommendations on the above?
> 
> In addition, I'm not quite sure how to structure it to minimise the cache
> reading - I don't want to have to read the cache (and initialise
> Graphhopper) for e.g. every route, as that's likely to be slow. It'd be nice
> if this was only done once (e.g. for each partition) and then all the routes
> in the partition processed with the same Graphhopper instance. Again, any
> thoughts on this?
> 
> FYI, discussion on Graphhoper forum is  here
> <https://discuss.graphhopper.com/t/how-to-use-graphhopper-in-spark/998>  ,
> though no luck there. 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Graphhopper-routing-in-Spark-tp27682.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Re: MLib : Non Linear Optimization

2016-09-08 Thread Robin East
Do you have any particular algorithms in mind? If you state the most common 
algorithms you use then it might stimulate the appropriate comments.



> On 8 Sep 2016, at 05:04, nsareen  wrote:
> 
> Any answer to this question group ?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/MLib-Non-Linear-Optimization-tp27645p27676.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Forecasting algorithms in spark ML

2016-09-08 Thread Robin East
Sparks algorithms are summarised on this page (https://spark.apache.org/mllib/) 
and details are available from the MLLib user guide which is linked from the 
above URL

Sent from my iPhone

> On 8 Sep 2016, at 05:30, Madabhattula Rajesh Kumar  
> wrote:
> 
> Hi,
> 
> Please let me know supported Forecasting algorithms in spark ML
> 
> Regards,
> Rajesh


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Robin East
Another approach is to use L1 regularisation eg 
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression.
 This adds a penalty term to the regression equation to reduce model 
complexity. When you use L1 (as opposed to say L2) this tends to promote 
sparsity in the coefficients i.e.some of the coefficients are pushed to zero, 
effectively deselecting them from the model.

Sent from my iPhone

> On 9 Aug 2016, at 04:19, Peyman Mohajerian  wrote:
> 
> You can try 'feature Importances' or 'feature selection' depending on what 
> else you want to do with the remaining features that's a possibility. Let's 
> say you are trying to do classification then some of the Spark Libraries have 
> a model parameter called 'featureImportances' that tell you which feature(s) 
> are more dominant in you classification, you can then run your model again 
> with the smaller set of features. 
> The two approaches are quite different, what I'm suggesting involves training 
> (supervised learning) in the context of a target function, with SVD you are 
> doing unsupervised learning.
> 
>> On Mon, Aug 8, 2016 at 7:23 PM, Rohit Chaddha  
>> wrote:
>> I would rather have less features to make better inferences on the data 
>> based on the smaller number of factors, 
>> Any suggestions Sean ? 
>> 
>>> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen  wrote:
>>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>>> really want to select features or just obtain a lower-dimensional
>>> representation of them, with less redundancy?
>>> 
>>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane  wrote:
>>> > There must be an algorithmic way to figure out which of these factors
>>> > contribute the least and remove them in the analysis.
>>> > I am hoping same one can throw some insight on this.
>>> >
>>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S  wrote:
>>> >>
>>> >> Not an expert here, but the first step would be devote some time and
>>> >> identify which of these 112 factors are actually causative. Some domain
>>> >> knowledge of the data may be required. Then, you can start of with PCA.
>>> >>
>>> >> HTH,
>>> >>
>>> >> Regards,
>>> >>
>>> >> Sivakumaran S
>>> >>
>>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane  wrote:
>>> >>
>>> >> Great question Rohit.  I am in my early days of ML as well and it would 
>>> >> be
>>> >> great if we get some idea on this from other experts on this group.
>>> >>
>>> >> I know we can reduce dimensions by using PCA, but i think that does not
>>> >> allow us to understand which factors from the original are we using in 
>>> >> the
>>> >> end.
>>> >>
>>> >> - Tony L.
>>> >>
>>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha 
>>> >> 
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>> I have a data-set where each data-point has 112 factors.
>>> >>>
>>> >>> I want to remove the factors which are not relevant, and say reduce to 
>>> >>> 20
>>> >>> factors out of these 112 and then do clustering of data-points using 
>>> >>> these
>>> >>> 20 factors.
>>> >>>
>>> >>> How do I do these and how do I figure out which of the 20 factors are
>>> >>> useful for analysis.
>>> >>>
>>> >>> I see SVD and PCA implementations, but I am not sure if these give which
>>> >>> elements are removed and which are remaining.
>>> >>>
>>> >>> Can someone please help me understand what to do here
>>> >>>
>>> >>> thanks,
>>> >>> -Rohit
>>> >>>
>>> >>
>>> >>
>>> >
> 


Re: Questions about ml.random forest (only one decision tree?)

2016-08-04 Thread Robin East
All supervised learning algorithms in Spark work the same way. You provide a 
set of ‘features’ (X) and a corresponding label (y) as part of a pipeline and 
call the fit method on the pipeline. The output of this is a model. You can 
then provide new examples (new Xs) to a transform method on the model that will 
give you a prediction for those examples. This means that the code for running 
different algorithms often looks very similar. The details of the algorithm are 
hidden behind the fit/transform interface.

In the case of Random Forest the implementation in Spark (i.e. behind the 
interface) is to create a number of different decision tree models (often quite 
simple models) and then ensemble the results of each decision tree. You don’t 
need to ‘create’ the decision trees yourself, that is handled by the 
implementation.

Hope that helps

Robin
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 4 Aug 2016, at 09:48, 陈哲 <czhenj...@gmail.com> wrote:
> 
> Hi all
>  I'm trying to use spark ml to do some prediction with random forest. By 
> reading the example code 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java
>  
> <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java>
>  , I can only find out it's similar to 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java
>  
> <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java>.
>  Is random forest algorithm suppose to use multiple decision trees to work. 
>  I'm new about spark and ml. Is there  anyone help me, maybe provide 
> example about using multiple decision trees in random forest in spark
> 
> Thanks
> Best Regards
> Patrick



Re: Is RowMatrix missing in org.apache.spark.ml package?

2016-07-27 Thread Robin East
Can you use the version from mllib? 
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 26 Jul 2016, at 18:20, Rohit Chaddha <rohitchaddha1...@gmail.com> wrote:
> 
> It is present in mlib but I don't seem to find it in ml package.
> Any suggestions please ? 
> 
> -Rohit



Re: ML PipelineModel to be scored locally

2016-07-21 Thread Robin East
MLeap is another option (Apache licensed) https://github.com/TrueCar/mleap


---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 21 Jul 2016, at 06:47, Simone <simone.mirag...@gmail.com> wrote:
> 
> Thanks for your reply. 
> 
> I cannot rely on jpmml due licensing stuff.
> I can evaluate writing my own prediction code, but I am looking for a more 
> general purpose approach. 
> 
> Any other thoughts?
> Best
> Simone
> Da: Peyman Mohajerian <mailto:mohaj...@gmail.com>
> Inviato: ‎20/‎07/‎2016 21:55
> A: Simone Miraglia <mailto:simone.mirag...@gmail.com>
> Cc: User <mailto:user@spark.apache.org>
> Oggetto: Re: ML PipelineModel to be scored locally
> 
> One option is to save the model in parquet or json format and then build your 
> own prediction code. Some also use: 
> https://github.com/jpmml/jpmml-sparkml 
> <https://github.com/jpmml/jpmml-sparkml>
> It depends on the model, e.g. ml v mllib and other factors whether this works 
> on or not. Couple of weeks ago there was a long discussion on this topic.
> 
> 
> On Wed, Jul 20, 2016 at 7:08 AM, Simone Miraglia <simone.mirag...@gmail.com 
> <mailto:simone.mirag...@gmail.com>> wrote:
> Hi all,
> 
> I am working on the following use case involving ML Pipelines.
> 
> 1. I created a Pipeline composed from a set of stages
> 2. I called "fit" method on my training set
> 3. I validated my model by calling "transform" on my test set
> 4. I stored my fitted Pipeline to a shared folder
> 
> Then I have a very low latency interactive application (say a kinda of web 
> service), that should work as follows:
> 1. The app receives a request
> 2. A scoring needs to be made, according to my fitted PipelineModel
> 3. The app sends the score to the caller, in a synchronous fashion
> 
> Is there a way to call the .transform method of the PipelineModel over a 
> single Row?
> 
> I will definitely not want to parallelize a single record to a DataFrame, nor 
> relying on Spark Streaming due to latency requirements.
> I would like to use something similar to mllib .predict(Vector) method which 
> does not rely on Spark Context performing all the computation locally.
> 
> Thanks in advance
> Best
> 



Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
if you are able to trace the underlying oracle session you can see whether a 
commit has been called or not.




> On 21 Jun 2016, at 09:57, Robin East <robin.e...@xense.co.uk> wrote:
> 
> I’m not sure - I don’t know what those APIs do under the hood. It simply rang 
> a bell with something I have fallen foul of in the past (not with Spark 
> though) - have wasted many hours forgetting to commit and then scratching my 
> head as why my data is not persisting.
> 
> 
> 
> 
>> On 21 Jun 2016, at 09:20, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> that is a very interesting point. I am not sure. how can I do that with
>> 
>> sorted.save("oraclehadoop.sales2")
>> 
>> like .. commit?
>> 
>> thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 21 June 2016 at 08:56, Robin East <robin.e...@xense.co.uk 
>> <mailto:robin.e...@xense.co.uk>> wrote:
>> random thought - do you need an explicit commit with the 2nd method?
>> 
>> 
>> 
>> 
>>> On 20 Jun 2016, at 21:35, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> I have a DF based on a table and sorted and shown below
>>> 
>>> This is fine and when I register as tempTable I can populate the underlying 
>>> table sales 2 in Hive. That sales2 is an ORC table 
>>> 
>>>  val s = HiveContext.table("sales_staging")
>>>   val sorted = s.sort("prod_id","cust_id","time_id","channel_id","promo_id")
>>>   sorted.registerTempTable("tmp")
>>>   sqltext = """
>>>   INSERT INTO TABLE oraclehadoop.sales2
>>>   SELECT
>>>   PROD_ID
>>> , CUST_ID
>>> , TIME_ID
>>> , CHANNEL_ID
>>> , PROMO_ID
>>> , QUANTITY_SOLD
>>> , AMOUNT_SOLD
>>>   FROM tmp
>>>   """
>>>   HiveContext.sql(sqltext)
>>>   HiveContext.sql("select count(1) from oraclehadoop.sales2").show
>>>   HiveContext.sql("truncate table oraclehadoop.sales2")
>>> 
>>>   sorted.save("oraclehadoop.sales2")
>>>   HiveContext.sql("select count(1) from oraclehadoop.sales2").show
>>> 
>>> When I truncate the Hive table and use sorted.save("oraclehadoop.sales2")
>>> 
>>> It does not save any data
>>> 
>>> Started at
>>> [20/06/2016 21:21:57.57]
>>> +--+
>>> |   _c0|
>>> +--+
>>> |918843|// This works
>>> +--+
>>> [Stage 7:>  (3 + 1) 
>>> / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder 
>>> <http://www.slf4j.org/codes.html#StaticLoggerBinder> for further details.
>>> +---+
>>> |_c0|
>>> +---+
>>> |  0|  // This does not
>>> +---+
>>> Finished at
>>> [20/06/2016 21:22:30.30]
>>> 
>>> Any ideas if anyone has seen this before?
>>> 
>>> 
>>> The issue is saving data. Saving through tempTable works but the other one 
>>> does not work.
>>> 
>>> 
>>> Thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>> 
>> 
> 



Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
I’m not sure - I don’t know what those APIs do under the hood. It simply rang a 
bell with something I have fallen foul of in the past (not with Spark though) - 
have wasted many hours forgetting to commit and then scratching my head as why 
my data is not persisting.




> On 21 Jun 2016, at 09:20, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> that is a very interesting point. I am not sure. how can I do that with
> 
> sorted.save("oraclehadoop.sales2")
> 
> like .. commit?
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 21 June 2016 at 08:56, Robin East <robin.e...@xense.co.uk 
> <mailto:robin.e...@xense.co.uk>> wrote:
> random thought - do you need an explicit commit with the 2nd method?
> 
> 
> 
> 
>> On 20 Jun 2016, at 21:35, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> I have a DF based on a table and sorted and shown below
>> 
>> This is fine and when I register as tempTable I can populate the underlying 
>> table sales 2 in Hive. That sales2 is an ORC table 
>> 
>>  val s = HiveContext.table("sales_staging")
>>   val sorted = s.sort("prod_id","cust_id","time_id","channel_id","promo_id")
>>   sorted.registerTempTable("tmp")
>>   sqltext = """
>>   INSERT INTO TABLE oraclehadoop.sales2
>>   SELECT
>>   PROD_ID
>> , CUST_ID
>> , TIME_ID
>> , CHANNEL_ID
>> , PROMO_ID
>> , QUANTITY_SOLD
>> , AMOUNT_SOLD
>>   FROM tmp
>>   """
>>   HiveContext.sql(sqltext)
>>   HiveContext.sql("select count(1) from oraclehadoop.sales2").show
>>   HiveContext.sql("truncate table oraclehadoop.sales2")
>> 
>>   sorted.save("oraclehadoop.sales2")
>>   HiveContext.sql("select count(1) from oraclehadoop.sales2").show
>> 
>> When I truncate the Hive table and use sorted.save("oraclehadoop.sales2")
>> 
>> It does not save any data
>> 
>> Started at
>> [20/06/2016 21:21:57.57]
>> +--+
>> |   _c0|
>> +--+
>> |918843|// This works
>> +--+
>> [Stage 7:>  (3 + 1) 
>> / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder 
>> <http://www.slf4j.org/codes.html#StaticLoggerBinder> for further details.
>> +---+
>> |_c0|
>> +---+
>> |  0|  // This does not
>> +---+
>> Finished at
>> [20/06/2016 21:22:30.30]
>> 
>> Any ideas if anyone has seen this before?
>> 
>> 
>> The issue is saving data. Saving through tempTable works but the other one 
>> does not work.
>> 
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
> 
> 



Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
random thought - do you need an explicit commit with the 2nd method?




> On 20 Jun 2016, at 21:35, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I have a DF based on a table and sorted and shown below
> 
> This is fine and when I register as tempTable I can populate the underlying 
> table sales 2 in Hive. That sales2 is an ORC table 
> 
>  val s = HiveContext.table("sales_staging")
>   val sorted = s.sort("prod_id","cust_id","time_id","channel_id","promo_id")
>   sorted.registerTempTable("tmp")
>   sqltext = """
>   INSERT INTO TABLE oraclehadoop.sales2
>   SELECT
>   PROD_ID
> , CUST_ID
> , TIME_ID
> , CHANNEL_ID
> , PROMO_ID
> , QUANTITY_SOLD
> , AMOUNT_SOLD
>   FROM tmp
>   """
>   HiveContext.sql(sqltext)
>   HiveContext.sql("select count(1) from oraclehadoop.sales2").show
>   HiveContext.sql("truncate table oraclehadoop.sales2")
> 
>   sorted.save("oraclehadoop.sales2")
>   HiveContext.sql("select count(1) from oraclehadoop.sales2").show
> 
> When I truncate the Hive table and use sorted.save("oraclehadoop.sales2")
> 
> It does not save any data
> 
> Started at
> [20/06/2016 21:21:57.57]
> +--+
> |   _c0|
> +--+
> |918843|// This works
> +--+
> [Stage 7:>  (3 + 1) / 
> 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder 
>  for further details.
> +---+
> |_c0|
> +---+
> |  0|  // This does not
> +---+
> Finished at
> [20/06/2016 21:22:30.30]
> 
> Any ideas if anyone has seen this before?
> 
> 
> The issue is saving data. Saving through tempTable works but the other one 
> does not work.
> 
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  



Re: What is the interpretation of Cores in Spark doc

2016-06-17 Thread Robin East
Agreed it’s a worthwhile discussion (and interesting IMO)

This is a section from your original post:

> It is about the terminology or interpretation of that in Spark doc.
> 
> This is my understanding of cores and threads.
> 
>  Cores are physical cores. Threads are virtual cores.

At least as far as Spark doc is concerned Threads are not synonymous with 
virtual cores; they are closely related concepts of course. So any time we want 
to have a discussion about architecture, performance, tuning, configuration etc 
we do need to be clear about the concepts and how they are defined.

Granted CPU hardware implementation can also refer to ’threads’. In fact 
Oracle/Sun seem unclear as to what they mean by thread - in various documents 
they define threads as:

A software entity that can be executed on hardware (e.g. Oracle SPARC 
Architecture 2011)

At other times as:

A thread is a hardware strand. Each thread, or strand, enjoys a unique set of 
resources in support of its … (e.g. OpenSPARC T1 Microarchitecture 
Specification)

So unless the documentation you are writing is very specific to your 
environment, and the idea that a thread is a logical processor is generally 
accepted, I would not be inclined to treat threads as if they are logical 
processors.



> On 16 Jun 2016, at 15:45, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Thanks all.
> 
> I think we are diverging but IMO it is a worthwhile discussion
> 
> Actually, threads are a hardware implementation - hence the whole notion of 
> “multi-threaded cores”.   What happens is that the cores often have duplicate 
> registers, etc. for holding execution state.   While it is correct that only 
> a single process is executing at a time, a single core will have execution 
> states of multiple processes preserved in these registers. In addition, it is 
> the core (not the OS) that determines when the thread is executed. The 
> approach often varies according to the CPU manufacturer, but the most simple 
> approach is when one thread of execution executes a multi-cycle operation 
> (e.g. a fetch from main memory, etc.), the core simply stops processing that 
> thread saves the execution state to a set of registers, loads instructions 
> from the other set of registers and goes on.  On the Oracle SPARC chips, it 
> will actually check the next thread to see if the reason it was ‘parked’ has 
> completed and if not, skip it for the subsequent thread. The OS is only aware 
> of what are cores and what are logical processors - and dispatches 
> accordingly.  Execution is up to the cores. .
> Cheers
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 16 June 2016 at 13:02, Robin East <robin.e...@xense.co.uk 
> <mailto:robin.e...@xense.co.uk>> wrote:
> Mich
> 
> >> A core may have one or more threads
> It would be more accurate to say that a core could run one or more threads 
> scheduled for execution. Threads are a software/OS concept that represent 
> executable code that is scheduled to run by the OS; A CPU, core or virtual 
> core/virtual processor execute that code. Threads are not CPUs or cores 
> whether physical or logical - any Spark documentation that implies this is 
> mistaken. I’ve looked at the documentation you mention and I don’t read it to 
> mean that threads are logical processors.
> 
> To go back to your original question, if you set local[6] and you have 12 
> logical processors then you are likely to have half your CPU resources unused 
> by Spark.
> 
> 
>> On 15 Jun 2016, at 23:08, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> I think it is slightly more than that.
>> 
>> These days  software is licensed by core (generally speaking).   That is the 
>> physical processor.A core may have one or more threads - or logical 
>> processors. Virtualization adds some fun to the mix.   Generally what they 
>> present is ‘virtual processors’.   What that equates to depends on the 
>> virtualization layer itself.   In some simpler VM’s - it is virtual=logical. 
>>   In others, virtual=logical but they are constrained to be from the same 
>> cores - e.g. if you get 6 virtual processors, it really is 3 full cores with 
>> 2 threads each.   Rational is due to the way OS dispatching works on 
>> ‘logical’ processors vs. cores and POSIX threaded applications.
>> 
>> HTH
>> 
>> Dr Mich Taleb

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Robin East
Mich

>> A core may have one or more threads
It would be more accurate to say that a core could run one or more threads 
scheduled for execution. Threads are a software/OS concept that represent 
executable code that is scheduled to run by the OS; A CPU, core or virtual 
core/virtual processor execute that code. Threads are not CPUs or cores whether 
physical or logical - any Spark documentation that implies this is mistaken. 
I’ve looked at the documentation you mention and I don’t read it to mean that 
threads are logical processors.

To go back to your original question, if you set local[6] and you have 12 
logical processors then you are likely to have half your CPU resources unused 
by Spark.


> On 15 Jun 2016, at 23:08, Mich Talebzadeh  wrote:
> 
> I think it is slightly more than that.
> 
> These days  software is licensed by core (generally speaking).   That is the 
> physical processor.A core may have one or more threads - or logical 
> processors. Virtualization adds some fun to the mix.   Generally what they 
> present is ‘virtual processors’.   What that equates to depends on the 
> virtualization layer itself.   In some simpler VM’s - it is virtual=logical.  
>  In others, virtual=logical but they are constrained to be from the same 
> cores - e.g. if you get 6 virtual processors, it really is 3 full cores with 
> 2 threads each.   Rational is due to the way OS dispatching works on 
> ‘logical’ processors vs. cores and POSIX threaded applications.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 13 June 2016 at 18:17, Mark Hamstra  > wrote:
> I don't know what documentation you were referring to, but this is clearly an 
> erroneous statement: "Threads are virtual cores."  At best it is terminology 
> abuse by a hardware manufacturer.  Regardless, Spark can't get too concerned 
> about how any particular hardware vendor wants to refer to the specific 
> components of their CPU architecture.  For us, a core is a logical execution 
> unit, something on which a thread of execution can run.  That can map in 
> different ways to different physical or virtual hardware. 
> 
> On Mon, Jun 13, 2016 at 12:02 AM, Mich Talebzadeh  > wrote:
> Hi,
> 
> It is not the issue of testing anything. I was referring to documentation 
> that clearly use the term "threads". As I said and showed before, one line is 
> using the term "thread" and the next one "logical cores".
> 
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 12 June 2016 at 23:57, Daniel Darabos  > wrote:
> Spark is a software product. In software a "core" is something that a process 
> can run on. So it's a "virtual core". (Do not call these "threads". A 
> "thread" is not something a process can run on.)
> 
> local[*] uses java.lang.Runtime.availableProcessors() 
> .
>  Since Java is software, this also returns the number of virtual cores. (You 
> can test this easily.)
> 
> 
> On Sun, Jun 12, 2016 at 9:23 PM, Mich Talebzadeh  > wrote:
> 
> Hi,
> 
> I was writing some docs on Spark P and came across this.
> 
> It is about the terminology or interpretation of that in Spark doc.
> 
> This is my understanding of cores and threads.
> 
>  Cores are physical cores. Threads are virtual cores. Cores with 2 threads is 
> called hyper threading technology so 2 threads per core makes the core work 
> on two loads at same time. In other words, every thread takes care of one 
> load.
> 
> Core has its own memory. So if you have a dual core with hyper threading, the 
> core works with 2 loads each at same time because of the 2 threads per core, 
> but this 2 threads will share memory in that core.
> 
> Some vendors as I am sure most of you aware charge licensing per core.
> 
> For example on the same host that I have Spark, I have a SAP product that 
> checks the licensing and shuts the application down if the license does not 
> agree with the cores speced.
> 
> This is what it says
> 
> ./cpuinfo
> License hostid:00e04c69159a 0050b60fd1e7
> Detected 12 logical processor(s), 6 core(s), in 1 chip(s)
> 
> So here I have 12 logical 

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread Robin East
Yes and even today CBO (e.g. in Oracle) will still require hints in some cases 
so I think it is more like:

RBO -> RBO + Hints -> CBO + Hints. Most relational databases meet significant 
numbers of corner cases where CBO plans simply don’t do what you would want. I 
don’t know enough about Spark SQL to comment on whether the same problems would 
afflict Spark.




> On 31 Mar 2016, at 15:54, Yong Zhang  wrote:
> 
> I agree that there won't be a generic solution for these kind of cases.
> 
> Without the CBO from Spark or Hadoop ecosystem in short future, maybe Spark 
> DataFrame/SQL should support more hints from the end user, as in these cases, 
> end users will be smart enough to tell the engine what is the correct way to 
> do.
> 
> Weren't the relational DBs doing exactly same path? RBO -> RBO + Hints -> CBO?
> 
> Yong
> 
> Date: Thu, 31 Mar 2016 16:07:14 +0530
> Subject: Re: SPARK-13900 - Join with simple OR conditions take too long
> From: hemant9...@gmail.com 
> To: ashokkumar.rajend...@gmail.com 
> CC: user@spark.apache.org 
> 
> Hi Ashok,
> 
> That's interesting. 
> 
> As I understand, on table A and B, a nested loop join (that will produce m X 
> n rows) is performed and than each row is evaluated to see if any of the 
> condition is met. You are asking that Spark should instead do a 
> BroadcastHashJoin on the equality conditions in parallel and then union the 
> results like you are doing in a different query. 
> 
> If we leave aside parallelism for a moment, theoretically, time taken for 
> nested loop join would vary little when the number of conditions are 
> increased while the time taken for the solution that you are suggesting would 
> increase linearly with number of conditions. So, when number of conditions 
> are too many, nested loop join would be faster than the solution that you 
> suggest. Now the question is, how should Spark decide when to do what? 
> 
> 
> Hemant Bhanawat 
> www.snappydata.io  
> 
> On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran 
> > 
> wrote:
> Hi,
> 
> I have filed ticket SPARK-13900. There was an initial reply from a developer 
> but did not get any reply on this. How can we do multiple hash joins together 
> for OR conditions based joins? Could someone please guide on how can we fix 
> this? 
> 
> Regards
> Ashok



Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-03-01 Thread Robin East
Mohammed and I both obviously have a certain bias here but I have to agree with 
him - the documentation is pretty good but other sources are necessary to 
supplement. (Good) books are a curated source of information that can short-cut 
a lot of the learning. 
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 1 Mar 2016, at 16:13, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> I agree that the Spark official documentation is pretty good. However, a book 
> also serves a useful purpose. It provides a structured roadmap for learning a 
> new technology. Everything is nicely organized for the reader. For somebody 
> who has just started learning Spark, the amount of material on the Internet 
> can be overwhelming. There are ton of blogs and presentations on the 
> Internet. A beginner could easily spend months reading them and still be 
> lost. If you are experienced, it is easy to figure out what to read and what 
> to skip.
>  
> I also agree that a book becomes outdated at some point, but not right away. 
> For example, a book covering DataFrames and Spark ML is not outdated yet.
>  
> Mohammed
> Author: Big Data Analytics with Spark 
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>  
> From: charles li [mailto:charles.up...@gmail.com 
> <mailto:charles.up...@gmail.com>] 
> Sent: Monday, February 29, 2016 1:39 AM
> To: Ashok Kumar
> Cc: User
> Subject: Re: Recommendation for a good book on Spark, beginner to moderate 
> knowledge
>  
> since spark is under actively developing, so take a book to learn it is 
> somehow outdated to some degree.
>  
> I would like to suggest learn it from several ways as bellow:
>  
> spark official document, trust me, you will go through this for several time 
> if you want to learn in well : http://spark.apache.org/ 
> <http://spark.apache.org/>
> spark summit, lots of videos and slide, high quality : 
> https://spark-summit.org/ <https://spark-summit.org/>
> databricks' blog : https://databricks.com/blog <https://databricks.com/blog>
> attend spark meetup : http://www.meetup.com/ <http://www.meetup.com/>
> try spark 3-party package if needed and convenient : 
> http://spark-packages.org/ <http://spark-packages.org/>
> and I just start to blog my spark learning memo on my blog: 
> http://litaotao.github.io <http://litaotao.github.io/> 
>  
> in a word, I think the best way to learn it is official document + databricks 
> blog + others' blog ===>>> your blog [ tutorial by you or just memo for your 
> learning ]
>  
> On Mon, Feb 29, 2016 at 4:50 PM, Ashok Kumar <ashok34...@yahoo.com.invalid 
> <mailto:ashok34...@yahoo.com.invalid>> wrote:
> Thank you all for valuable advice. Much appreciated
>  
> Best
>  
> 
> On Sunday, 28 February 2016, 21:48, Ashok Kumar <ashok34...@yahoo.com 
> <mailto:ashok34...@yahoo.com>> wrote:
>  
> 
>   Hi Gurus,
>  
> Appreciate if you recommend me a good book on Spark or documentation for 
> beginner to moderate knowledge
>  
> I very much like to skill myself on transformation and action methods.
>  
> FYI, I have already looked at examples on net. However, some of them not 
> clear at least to me.
>  
> Warmest regards
>  
> 
> 
> 
>  
> -- 
> --
> a spark lover, a quant, a developer and a good man.
>  
> http://github.com/litaotao <http://github.com/litaotao>


Re: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Robin East
Whilst I can think of other ways to do it I don’t think they would be 
conceptually or syntactically any simpler. GraphX doesn’t have the concept of 
built-in vertex properties which would make this simpler - a vertex in GraphX 
is a Vertex ID (Long) and a bunch of custom attributes that you assign. This 
means you have to find a way of ‘pushing’ the vertex degree into the graph so 
you can do comparisons (cf a join in relational databases) or as you have done 
create a list and filter against that (cf filtering against a sub-query in 
relational database). 

One thing I would point out is that you probably want to avoid 
finalVerexes.collect() for a large-scale system - this will pull all the 
vertices into the driver and then push them out to the executors again as part 
of the filter operation. A better strategy for large graphs would be:

1. build a graph based on the existing graph where the vertex attribute is the 
vertex degree - the GraphX documentation shows how to do this
2. filter this “degrees” graph to just give you 0 degree vertices
3 use graph.mask passing in the 0-degree graph to get the original graph with 
just 0 degree vertices

Just one variation on several possibilities, the key point is that everything 
is just a graph transformation until you call an action on the resulting graph
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 26 Feb 2016, at 11:59, Guillermo Ortiz <konstt2...@gmail.com> wrote:
> 
> I'm new with graphX. I need to get the vertex without out edges..
> I guess that it's pretty easy but I did it pretty complicated.. and 
> inefficienct 
> 
> val vertices: RDD[(VertexId, (List[String], List[String]))] =
>   sc.parallelize(Array((1L, (List("a"), List[String]())),
> (2L, (List("b"), List[String]())),
> (3L, (List("c"), List[String]())),
> (4L, (List("d"), List[String]())),
> (5L, (List("e"), List[String]())),
> (6L, (List("f"), List[String]()
> 
> // Create an RDD for edges
> val relationships: RDD[Edge[Boolean]] =
>   sc.parallelize(Array(Edge(1L, 2L, true), Edge(2L, 3L, true), Edge(3L, 4L, 
> true), Edge(5L, 2L, true)))
> val out = minGraph.outDegrees.map(vertex => vertex._1)
> val finalVertexes = minGraph.vertices.keys.subtract(out)
> //It must be something better than this way..
> val nodes = finalVertexes.collect()
> val result = minGraph.vertices.filter(v => nodes.contains(v._1))
> 
> What's the good way to do this operation? It seems that it should be pretty 
> easy.



Re: Reindexing in graphx

2016-02-24 Thread Robin East
It looks like you adding vertices one-by-one, you definitely don’t want to do 
that. What happens when you batch together 400 vertices into an RDD and then 
add 400 in one go?
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 24 Feb 2016, at 05:49, Udbhav Agarwal <udbhav.agar...@syncoms.com> wrote:
> 
> Thank you Robin for your reply.
> Actually I am adding bunch of vertices in a graph in graphx using the 
> following method . I am facing the problem of latency. First time an addition 
> of say 400 vertices to a graph with 100,000 nodes takes around 7 seconds. 
> next time its taking 15 seconds. So every subsequent adds are taking more 
> time than the previous one. Hence I tried to do reindex() so the subsequent 
> operations can also be performed fast. 
> FYI My cluster is presently having one machine with 8 core and 8 gb ram. I am 
> running in local mode.
>  
> def addVertex(rdd: RDD[String], sc: SparkContext, session: String): Long = {
> val defaultUser = (0, 0)
> rdd.collect().foreach { x =>
>   {
> val aVertex: RDD[(VertexId, (Int, Int))] = 
> sc.parallelize(Array((x.toLong, (100, 100
> gVertices = gVertices.union(aVertex)
>   }
> }
> inputGraph = Graph(gVertices, gEdges, defaultUser)
> inputGraph.cache()
> gVertices = inputGraph.vertices
> gVertices.cache()
>     val count = gVertices.count
> println(count);
> 
> return 1;
>   }
>  
>  
> From: Robin East [mailto:robin.e...@xense.co.uk] 
> Sent: Tuesday, February 23, 2016 8:15 PM
> To: Udbhav Agarwal <udbhav.agar...@syncoms.com>
> Subject: Re: Reindexing in graphx
>  
> Hi
>  
> Well this is the line that is failing in VertexRDDImpl:
>  
> require(partitionsRDD.partitioner.isDefined)
>  
> But really you shouldn’t need to be calling the reindex() function as it 
> deals with some internals of the GraphX implementation - it looks to me like 
> it ought to be a private method. Perhaps you could explain what you are 
> trying to achieve.
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action 
> <http://www.manning.com/books/spark-graphx-in-action>
>  
>  
>  
> 
>  
> On 23 Feb 2016, at 12:18, Udbhav Agarwal <udbhav.agar...@syncoms.com 
> <mailto:udbhav.agar...@syncoms.com>> wrote:
>  
> Hi,
> I am trying to add vertices to a graph in graphx and I want to do reindexing 
> in the graph. I can see there is an option of vertices.reindex() in graphX. 
> But when I am doing graph.vertices.reindex() am getting 
> Java.lang.IllegalArgumentException: requirement failed.
> Please help me know what I am missing with the syntax as I have seen the API 
> documentation where only vertices.reindex() is mentioned.
>  
> Thanks,
> Udbhav Agarwal



Re: Constantly increasing Spark streaming heap memory

2016-02-22 Thread Robin East
Hi

What you describe looks like normal behaviour for almost any Java/Scala 
application - objects are created on the heap until a limit point is reached 
and then GC clears away memory allocated to objects that are no longer 
referenced. Is there an issue you are experiencing?





> On 21 Feb 2016, at 00:37, Walid LEZZAR  wrote:
> 
> Hi,
> 
> I'm running a Spark Streaming job that pulls data from Kafka (using the 
> direct approach method - without receiver) and pushes it into elasticsearch. 
> The job is running fine but I was suprised once I opened jconsole to monitor 
> it : I noticed that the heap memory is constantly increasing until the GC 
> triggers, and then it restarts increasing again and so on.
> 
> I tried to use a profiler to understand what is happening in the heap. All I 
> found is a byte[] object that is constantly increasing. But no more details. 
> 
> Is there an explanation to that ? Is this behaviour inherent to Spark 
> Streaming jobs ?
> 
> Thanks for your help.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-15 Thread Robin East
re 1.
The pull requests reference the JIRA ticket in this case 
https://issues.apache.org/jira/browse/SPARK-5133 
<https://issues.apache.org/jira/browse/SPARK-5133>. The JIRA says it was 
released in 1.5.


---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 15 Jan 2016, at 16:06, Scott Imig <si...@richrelevance.com> wrote:
> 
> Hello,
> 
> I have a couple of quick questions about this pull request, which adds 
> feature importance calculations to the random forests in MLLib.
> 
> https://github.com/apache/spark/pull/7838 
> <https://github.com/apache/spark/pull/7838>
> 
> 1. Can someone help me determine the Spark version where this is first 
> available?  (1.5.0?  1.5.1?)
> 
> 2. Following the templates in this  documentation to construct a 
> RandomForestModel, should I be able to retrieve model.featureImportances?  Or 
> is there a different pattern for random forests in more recent spark versions?
> 
> https://spark.apache.org/docs/1.2.0/mllib-ensembles.html 
> <https://spark.apache.org/docs/1.2.0/mllib-ensembles.html>
> 
> Thanks for the help!
> Imig
> -- 
> S. Imig | Senior Data Scientist Engineer | richrelevance |m: 425.999.5725
> 
> I support Bip 101 and BitcoinXT <https://bitcoinxt.software/>.



Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)

First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
> 
> Dears, for one project, I need to implement something so Spark can read data 
> from a C++ process. 
> To provide high performance, I really hope to implement this through shared 
> memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to do 
> this, but I wonder whether there is any existing efforts or more efficient 
> approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 



Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
Hi Annabel

I certainly did read your post. My point was that Spark can read from HDFS but 
is in no way tied to that storage layer . A very interesting use case that 
sounds very similar to Jia's (as mentioned by another poster) is contained in 
https://issues.apache.org/jira/browse/SPARK-10399. The comments section 
provides a specific example of processing very large images using a 
pre-existing c++ library.

Robin

Sent from my iPhone

> On 7 Dec 2015, at 18:50, Annabel Melongo <melongo_anna...@yahoo.com.INVALID> 
> wrote:
> 
> Jia,
> 
> I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
> What you're requesting, reading and writing to a C++ process, is not part of 
> that requirement.
> 
> 
> 
> 
> 
> On Monday, December 7, 2015 1:42 PM, Jia <jacqueline...@gmail.com> wrote:
> 
> 
> Thanks, Annabel, but I may need to clarify that I have no intention to write 
> and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
> data to a C++ process with zero copy.
> 
> Best Regards,
> Jia
>  
> 
> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> 
>> wrote:
>> 
>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, 
>> I'm afraid this is not possible. Spark has support for Java, Python, Scala 
>> and R.
>> 
>> The best way to achieve this is to run your application in C++ and used the 
>> data created by said application to do manipulation within Spark.
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
>> 
>> 
>> Thanks, Dewful!
>> 
>> My impression is that Tachyon is a very nice in-memory file system that can 
>> connect to multiple storages.
>> However, because our data is also hold in memory, I suspect that connecting 
>> to Spark directly may be more efficient in performance.
>> But definitely I need to look at Tachyon more carefully, in case it has a 
>> very efficient C++ binding mechanism.
>> 
>> Best Regards,
>> Jia
>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote:
>>> 
>>> Maybe looking into something like Tachyon would help, I see some sample c++ 
>>> bindings, not sure how much of the current functionality they support...
>>> Hi, Robin, 
>>> Thanks for your reply and thanks for copying my question to user mailing 
>>> list.
>>> Yes, we have a distributed C++ application, that will store data on each 
>>> node in the cluster, and we hope to leverage Spark to do more fancy 
>>> analytics on those data. But we need high performance, that’s why we want 
>>> shared memory.
>>> Suggestions will be highly appreciated!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:
>>>> 
>>>> -dev, +user (this is not a question about development of Spark itself so 
>>>> you’ll get more answers in the user mailing list)
>>>> 
>>>> First up let me say that I don’t really know how this could be done - I’m 
>>>> sure it would be possible with enough tinkering but it’s not clear what 
>>>> you are trying to achieve. Spark is a distributed processing system, it 
>>>> has multiple JVMs running on different machines that each run a small part 
>>>> of the overall processing. Unless you have some sort of idea to have 
>>>> multiple C++ processes collocated with the distributed JVMs using named 
>>>> memory mapped files doesn’t make architectural sense. 
>>>> ---
>>>> Robin East
>>>> Spark GraphX in Action Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
>>>>> 
>>>>> Dears, for one project, I need to implement something so Spark can read 
>>>>> data from a C++ process. 
>>>>> To provide high performance, I really hope to implement this through 
>>>>> shared memory between the C++ process and Java JVM process.
>>>>> It seems it may be possible to use named memory mapped files and JNI to 
>>>>> do this, but I wonder whether there is any existing efforts or more 
>>>>> efficient approach to do this?
>>>>> Thank you very much!
>>>>> 
>>>>> Best Regards,
>>>>> Jia
>>>>> 
>>>>> 
>>>>> -
>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 
> 


Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
I guess you could write a custom RDD that can read data from a memory-mapped 
file - not really my area of expertise so I’ll leave it to other members of the 
forum to chip in with comments as to whether that makes sense. 

But if you want ‘fancy analytics’ then won’t the processing time more than 
out-weigh the savings from using memory mapped files? Particularly if your 
analytics involve any kind of aggregation of data across data nodes. Have you 
looked at a Lambda architecture which could involve Spark but doesn’t 
necessarily mean you would go to the trouble of implementing a custom 
memory-mapped file reading feature.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 7 Dec 2015, at 17:32, Jia <jacqueline...@gmail.com> wrote:
> 
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node 
> in the cluster, and we hope to leverage Spark to do more fancy analytics on 
> those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk 
> <mailto:robin.e...@xense.co.uk>> wrote:
> 
>> -dev, +user (this is not a question about development of Spark itself so 
>> you’ll get more answers in the user mailing list)
>> 
>> First up let me say that I don’t really know how this could be done - I’m 
>> sure it would be possible with enough tinkering but it’s not clear what you 
>> are trying to achieve. Spark is a distributed processing system, it has 
>> multiple JVMs running on different machines that each run a small part of 
>> the overall processing. Unless you have some sort of idea to have multiple 
>> C++ processes collocated with the distributed JVMs using named memory mapped 
>> files doesn’t make architectural sense. 
>> -------
>> Robin East
>> Spark GraphX in Action Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action 
>> <http://www.manning.com/books/spark-graphx-in-action>
>> 
>> 
>> 
>> 
>> 
>>> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com 
>>> <mailto:jacqueline...@gmail.com>> wrote:
>>> 
>>> Dears, for one project, I need to implement something so Spark can read 
>>> data from a C++ process. 
>>> To provide high performance, I really hope to implement this through shared 
>>> memory between the C++ process and Java JVM process.
>>> It seems it may be possible to use named memory mapped files and JNI to do 
>>> this, but I wonder whether there is any existing efforts or more efficient 
>>> approach to do this?
>>> Thank you very much!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>>> <mailto:dev-unsubscr...@spark.apache.org>
>>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>> <mailto:dev-h...@spark.apache.org>
>>> 
>> 
> 



Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
Annabel

Spark works very well with data stored in HDFS but is certainly not tied to it. 
Have a look at the wide variety of connectors to things like Cassandra, HBase, 
etc.

Robin

Sent from my iPhone

> On 7 Dec 2015, at 18:50, Annabel Melongo <melongo_anna...@yahoo.com> wrote:
> 
> Jia,
> 
> I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
> What you're requesting, reading and writing to a C++ process, is not part of 
> that requirement.
> 
> 
> 
> 
> 
> On Monday, December 7, 2015 1:42 PM, Jia <jacqueline...@gmail.com> wrote:
> 
> 
> Thanks, Annabel, but I may need to clarify that I have no intention to write 
> and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
> data to a C++ process with zero copy.
> 
> Best Regards,
> Jia
>  
> 
> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> 
>> wrote:
>> 
>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, 
>> I'm afraid this is not possible. Spark has support for Java, Python, Scala 
>> and R.
>> 
>> The best way to achieve this is to run your application in C++ and used the 
>> data created by said application to do manipulation within Spark.
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
>> 
>> 
>> Thanks, Dewful!
>> 
>> My impression is that Tachyon is a very nice in-memory file system that can 
>> connect to multiple storages.
>> However, because our data is also hold in memory, I suspect that connecting 
>> to Spark directly may be more efficient in performance.
>> But definitely I need to look at Tachyon more carefully, in case it has a 
>> very efficient C++ binding mechanism.
>> 
>> Best Regards,
>> Jia
>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote:
>>> 
>>> Maybe looking into something like Tachyon would help, I see some sample c++ 
>>> bindings, not sure how much of the current functionality they support...
>>> Hi, Robin, 
>>> Thanks for your reply and thanks for copying my question to user mailing 
>>> list.
>>> Yes, we have a distributed C++ application, that will store data on each 
>>> node in the cluster, and we hope to leverage Spark to do more fancy 
>>> analytics on those data. But we need high performance, that’s why we want 
>>> shared memory.
>>> Suggestions will be highly appreciated!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:
>>>> 
>>>> -dev, +user (this is not a question about development of Spark itself so 
>>>> you’ll get more answers in the user mailing list)
>>>> 
>>>> First up let me say that I don’t really know how this could be done - I’m 
>>>> sure it would be possible with enough tinkering but it’s not clear what 
>>>> you are trying to achieve. Spark is a distributed processing system, it 
>>>> has multiple JVMs running on different machines that each run a small part 
>>>> of the overall processing. Unless you have some sort of idea to have 
>>>> multiple C++ processes collocated with the distributed JVMs using named 
>>>> memory mapped files doesn’t make architectural sense. 
>>>> ---
>>>> Robin East
>>>> Spark GraphX in Action Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
>>>>> 
>>>>> Dears, for one project, I need to implement something so Spark can read 
>>>>> data from a C++ process. 
>>>>> To provide high performance, I really hope to implement this through 
>>>>> shared memory between the C++ process and Java JVM process.
>>>>> It seems it may be possible to use named memory mapped files and JNI to 
>>>>> do this, but I wonder whether there is any existing efforts or more 
>>>>> efficient approach to do this?
>>>>> Thank you very much!
>>>>> 
>>>>> Best Regards,
>>>>> Jia
>>>>> 
>>>>> 
>>>>> -
>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 
> 


Re: LDA topic modeling and Spark

2015-12-03 Thread Robin East
What exactly is this probability distribution? For each word in your vocabulary 
it is the probability that a randomly drawn word from a topic is that word. 
Another way to visualise it is a 2-column vector where the 1st column is a word 
in your vocabulary and the 2nd column is the probability of that word 
appearing. All the values in the 2nd-column must be >= 0 and if you add up all 
the values they should sum to 1. That is the definition of a probability 
distribution. 

Clearly for the idea of topics to be at all useful you want different topics to 
exhibit different probability distributions i.e. some words to be more likely 
in 1 topic compared to another topic.

How does it actually infer words and topics? Probably a good idea to google for 
that one if you really want to understand the details - there are some great 
resources available.

How can I connect the output to the actual words in each topic? A typical way 
is to look at the top 5, 10 or 20 words in each topic and use those to infer 
something about what the topic represents.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 3 Dec 2015, at 05:07, Nguyen, Tiffany T <nguye...@grinnell.edu> wrote:
> 
> Hello,
> 
> I have been trying to understand the LDA topic modeling example provided 
> here: 
> https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
>  
> <https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda>.
>  In the example, they load word count vectors from a text file that contains 
> these word counts and then they output the topics, which is represented as 
> probability distributions over words. What exactly is this probability 
> distribution? How does it actually infer words and topics and how can I 
> connect the output to the actual words in each topic?
> 
> Thanks!



Re: spark performance - executor computing time

2015-09-16 Thread Robin East
Is this repeatable? Do you always get one or two executors that are 6 times as 
slow? It could be that some of your tasks have more work to do (maybe you are 
filtering some records out? If it’s always one particular worker node is there 
something about the machine configuration (e.g. CPU speed) that means the 
processing takes longer.

—
Robin East
Spark GraphX in Action Michael S Malak and Robin East
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>

> On 15 Sep 2015, at 12:35, patcharee <patcharee.thong...@uni.no> wrote:
> 
> Hi,
> 
> I was running a job (on Spark 1.5 + Yarn + java 8). In a stage that lookup 
> (org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:873)) 
> there was an executor that took the executor computing time > 6 times of 
> median. This executor had almost the same shuffle read size and low gc time 
> as others.
> 
> What can impact the executor computing time? Any suggestions what parameters 
> I should monitor/configure?
> 
> BR,
> Patcharee
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



Re: Applying transformations on a JavaRDD using reflection

2015-09-09 Thread Robin East
Have you got some code already that demonstrates the problem?
> On 9 Sep 2015, at 04:45, Nirmal Fernando  wrote:
> 
> Any thoughts?
> 
> On Tue, Sep 8, 2015 at 3:37 PM, Nirmal Fernando  > wrote:
> Hi All,
> 
> I'd like to apply a chain of Spark transformations (map/filter) on a given 
> JavaRDD. I'll have the set of Spark transformations as Function, and 
> even though I can determine the classes of T and A at the runtime, due to the 
> type erasure, I cannot call JavaRDD's transformations as they expect 
> generics. Any idea on how to resolve this?
> 
> -- 
> 
> Thanks & regards,
> Nirmal
> 
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733 
> Blog: http://nirmalfdo.blogspot.com/ 
> 
> 
> 
> 
> 
> -- 
> 
> Thanks & regards,
> Nirmal
> 
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/ 
> 
> 



Re: Build k-NN graph for large dataset

2015-08-26 Thread Robin East
You could try dimensionality reduction (PCA or SVD) first. I would imagine that 
even if you could successfully compute similarities in the high-dimensional 
space you would probably run into the curse of dimensionality.
 On 26 Aug 2015, at 12:35, Jaonary Rabarisoa jaon...@gmail.com wrote:
 
 Dear all,
 
 I'm trying to find an efficient way to build a k-NN graph for a large 
 dataset. Precisely, I have a large set of high dimensional vector (say d  
 1) and I want to build a graph where those high dimensional points are 
 the vertices and each one is linked to the k-nearest neighbor based on some 
 kind similarity defined on the vertex spaces. 
 My problem is to implement an efficient algorithm to compute the weight 
 matrix of the graph. I need to compute a N*N similarities and the only way I 
 know is to use cartesian operation follow by map operation on RDD. But, 
 this is very slow when the N is large. Is there a more cleaver way to do this 
 for an arbitrary similarity function ? 
 
 Cheers,
 
 Jao


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark ec2 lunch problem

2015-08-24 Thread Robin East
spark-ec2 is the way to go however you may need to debug connectivity issues. 
For example do you know that the servers were correctly setup in AWS and can 
you access each node using ssh? If no then you need to work out why (it’s not a 
spark issue). If yes then you will need to work out why ssh via the spark-ec2 
script is not working.

I’ve used spark-ec2 successfully many times but have never used the —vpc-id and 
—subnet-id options and that may be the source of your problems, especially 
since it appears to be a hostname resolution issue. If you could confirm the 
above questions then maybe someone on the list can help diagnose the specific 
problem.


---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/malak/ http://www.manning.com/malak/

 On 24 Aug 2015, at 13:45, Garry Chen g...@cornell.edu wrote:
 
 So what is the best way to deploy spark cluster in EC2 environment any 
 suggestions?
  
 Garry
  
 From: Akhil Das [mailto:ak...@sigmoidanalytics.com 
 mailto:ak...@sigmoidanalytics.com] 
 Sent: Friday, August 21, 2015 4:27 PM
 To: Garry Chen g...@cornell.edu mailto:g...@cornell.edu
 Cc: user@spark.apache.org mailto:user@spark.apache.org
 Subject: Re: Spark ec2 lunch problem
  
 It may happen that the version of spark-ec2 script you are using is buggy or 
 sometime AWS have problem provisioning machines.
 
 On Aug 21, 2015 7:56 AM, Garry Chen g...@cornell.edu 
 mailto:g...@cornell.edu wrote:
 Hi All,
 I am trying to lunch a spark ec2 cluster by running  
 spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc 
 --subnet-id=subnet-011 --spark-version=1.4.1 launch spark-cluster but getting 
 following message endless.  Please help.
  
  
 Warning: SSH connection error. (This could be temporary.)
 Host:
 SSH return code: 255
 SSH output: ssh: Could not resolve hostname : Name or service not known



Re: Spark return key value pair

2015-08-19 Thread Robin East
Dawid is right, if you did words.count it would be twice the number of input 
lines. You can use map like this:

words = lines.map(mapper2)

   for i in words.take(10):
   msg = i[0] + :” + i[1] + \n”

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/malak/ http://www.manning.com/malak/


 On 19 Aug 2015, at 12:19, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote:
 
 I am not 100% sure but probably flatMap unwinds the tuples. Try with map 
 instead.
 
 2015-08-19 13:10 GMT+02:00 Jerry OELoo oylje...@gmail.com 
 mailto:oylje...@gmail.com:
 Hi.
 I want to parse a file and return a key-value pair with pySpark, but
 result is strange to me.
 the test.sql is a big fie and each line is usename and password, with
 # between them, I use below mapper2 to map data, and in my
 understanding, i in words.take(10) should be a tuple, but the result
 is that i is username or password, this is strange for me to
 understand, Thanks for you help.
 
 def mapper2(line):
 
 words = line.split('#')
 return (words[0].strip(), words[1].strip())
 
 def main2(sc):
 
 lines = sc.textFile(hdfs://master:9000/spark/test.sql)
 words = lines.flatMap(mapper2)
 
 for i in words.take(10):
 msg = i + : + \n
 
 
 --
 Rejoice,I Desire!
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Robin East
The OP’s problem is he gets this:

console:47: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
 required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector : 
org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)

The solution is to ensure you have a RDD[Vector] not RDD[DenseVector]

 On 23 Jul 2015, at 15:30, Rishi Yadav ri...@infoobjects.com wrote:
 
 can you explain what transformation is failing. Here's a simple example.
 
 http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/
  
 http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/
 
 On Thu, Jul 23, 2015 at 5:37 AM, saif.a.ell...@wellsfargo.com 
 mailto:saif.a.ell...@wellsfargo.com wrote:
 I tried with a RDD[DenseVector] but RDDs are not transformable, so T+ 
 RDD[DenseVector] not : RDD[Vector] and can’t get to use the RDD input method 
 of correlation.
  
 Thanks,
 Saif
  
 



Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread Robin East
The first question I would ask is have you determined whether you have a 
performance issue writing to Oracle? In particular how many commits are you 
making? If you are issuing a lot of commits that would be a performance problem.

Robin

 On 22 Jul 2015, at 19:11, diplomatic Guru diplomaticg...@gmail.com wrote:
 
 Hello all,
 
 We are having a major performance issue with the Spark, which is holding us 
 from going live.
 
 We have a job that carries out computation on log files and write the results 
 into Oracle DB.
 
 The reducer 'reduceByKey'  have been set to parallelize by 4 as we don't want 
 to establish too many DB connections.
 
 We are then calling the foreachPartition on the RDD pairs that were reduced 
 by the key. Within this foreachPartition method we establish DB connection, 
 then iterate the results, prepare the Oracle statement for batch insertion 
 then we commit the batch and close the connection. All these are working fine.
 
 However, when we execute the job to process 12GB of data, it takes forever to 
 complete, especially at the foreachPartition stage.
 
 We submitted the job with 6 executors, 2 cores, and 6GB memory of which 0.3 
 is assigned to spark.storage.memoryFraction.
 
 The job is taking about 50 minutes to complete, which is not ideal. I'm not 
 sure how we could enhance the performance. I've provided the main body of the 
 codes, please take a look and advice:
 
 From Driver:
 
 reduceResultsRDD.foreachPartition(new DB.InsertFunction( 
 dbuser,dbpass,batchsize));
 
 
 DB class:
 
 public class DB {
   private static final Logger logger = LoggerFactory
   .getLogger(DB.class);
   
 public static class InsertFunction implements
   VoidFunctionIteratorTuple2String, String {
 
   private static final long serialVersionUID = 55766876878L;
   private String dbuser = ;
   private String dbpass = ;
   private int batchsize;
 
   public InsertFunction(String dbuser, String dbpass, int 
 batchsize) {
   super();
   this.dbuser = dbuser;
   this.dbuser = dbuser;
   this.batchsize=batchsize;
   }
 
 @Override
   public void call(IteratorTuple2String, String results) {
   Connection connect = null;
   PreparedStatement pstmt = null;
   try {
   connect = getDBConnection(dbuser,
   dbpass);
 
   int count = 0;
 
   if (batchsize = 0) {
   batchsize = 1;
   }
 
   pstmt1 = connect
   .prepareStatement(MERGE INTO 
 SOME TABLE IF RECORD FOUND, IF NOT INSERT);
 
   while (results.hasNext()) {
 
   Tuple2String, String kv = 
 results.next();
   
   String [] data = 
 kv._1.concat(, +kv._2).split(,);
 
   
   pstmt.setString(1, data[0].toString());
   pstmt.setString(2, data[1].toString());
.
 
   pstmt.addBatch();
 
   count++;
 
   if (count == batchsize) {
   logger.info 
 http://logger.info/(BulkCount :  + count);
   pstmt.executeBatch();
   connect.commit();
   count = 0;
   }
 
   pstmt.executeBatch();
   connect.commit();
 
   }
 
   pstmt.executeBatch();
   connect.commit();
 
   } catch (Exception e) {
   logger.error(InsertFunction error:  + 
 e.getMessage());
   } finally {
 
   if (pstmt != null) {
   pstmt.close();
   }
 
   try {
   
   connect.close();
   } catch (SQLException e) {
   logger.error(InsertFunction Connection 
 Close error: 
   + e.getMessage());
   }
   }
   }
 
   }
 }



Re: Is spark suitable for real time query

2015-07-22 Thread Robin East
Real-time is, of course, relative but you’ve mentioned microsecond level. Spark 
is designed to process large amounts of data in a distributed fashion. No 
distributed system I know of could give any kind of guarantees at the 
microsecond level.

Robin

 On 22 Jul 2015, at 11:14, Louis Hust louis.h...@gmail.com wrote:
 
 Hi, all
 
 I am using spark jar in standalone mode, fetch data from different mysql 
 instance and do some action, but i found the time is at second level.
 
 So i want to know if spark job is suitable for real time query which at 
 microseconds?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Research ideas using spark

2015-07-15 Thread Robin East
Well said Will. I would add that you might want to investigate GraphChi which 
claims to be able to run a number of large-scale graph processing tasks on a 
workstation much quicker than a very large Hadoop cluster. It would be 
interesting to know how widely applicable the approach GraphChi takes and what 
implications it has for parallel/distributed computing approaches. A rich seam 
to mine indeed.

Robin
 On 15 Jul 2015, at 14:48, William Temperley willtemper...@gmail.com wrote:
 
 There seems to be a bit of confusion here - the OP (doing the PhD) had the 
 thread hijacked by someone with a similar name asking a mundane question.
 
 It would be a shame to send someone away so rudely, who may do valuable work 
 on Spark.
 
 Sashidar (not Sashid!) I'm personally interested in running graph algorithms 
 for image segmentation using MLib and Spark.  I've got many questions though 
 - like is it even going to give me a speed-up?  
 (http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html 
 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
 
 It's not obvious to me which classes of graph algorithms can be implemented 
 correctly and efficiently in a highly parallel manner.  There's tons of work 
 to be done here, I'm sure. Also, look at parallel geospatial algorithms - 
 there's a lot of work being done on this.
 
 Best, Will
 
 
 
 On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com 
 mailto:vineelyalamar...@gmail.com wrote:
 Hi Daniel
 
 Well said
 
 Regards 
 Vineel
 
 
 On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos 
 daniel.dara...@lynxanalytics.com mailto:daniel.dara...@lynxanalytics.com 
 wrote:
 Hi Shahid,
 To be honest I think this question is better suited for Stack Overflow than 
 for a PhD thesis.
 
 On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com 
 mailto:sha...@trialx.com wrote:
 hi 
 
 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of 
 partitions i get is 9. I am running a spark application , it gets stuck on 
 one of tasks, looking at the UI it seems application is not using all nodes 
 to do calculations. attached is the screen shot of tasks, it seems tasks are 
 put on each node more then once. looking at tasks 8 tasks get completed under 
 7-8 minutes and one task takes around 30 minutes so causing the delay in 
 results. 
 
 
 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com 
 mailto:raoshashidhar...@gmail.com wrote:
 Hi,
 
 I am doing my PHD thesis on large scale machine learning e.g  Online 
 learning, batch and mini batch learning.
 
 Could somebody help me with ideas especially in the context of Spark and to 
 the above learning methods. 
 
 Some ideas like improvement to existing algorithms, implementing new features 
 especially the above learning methods and algorithms that have not been 
 implemented etc.
 
 If somebody could help me with some ideas it would really accelerate my work.
 
 Plus few ideas on research papers regarding Spark or Mahout.
 
 Thanks in advance.
 
 Regards 
 
 
 
 -- 
 with Regards
 Shahid Ashraf
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Robin East
You’ll get this issue if you just take the first 2000 lines of that file. The 
problem is triangleCount() expects srdId  dstId which is not the case in the 
file (e.g. vertex 28). You can get round this by calling 
graph.convertToCanonical Edges() which removes bi-directional edges and ensures 
srcId  dstId. Which version of Spark are you on? Can’t remember what version 
that method was introduced in.

Robin
 On 26 Jun 2015, at 09:44, Roman Sokolov ole...@gmail.com wrote:
 
 Ok, but what does it means? I did not change the core files of spark, so is 
 it a bug there?
 PS: on small datasets (500 Mb) I have no problem.
 
 Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com 
 mailto:yuzhih...@gmail.com:
 The assertion failure from TriangleCount.scala corresponds with the following 
 lines:
 
 g.outerJoinVertices(counters) {
   (vid, _, optCounter: Option[Int]) =
 val dblCount = optCounter.getOrElse(0)
 // double count should be even (divisible by two)
 assert((dblCount  1) == 0)
 
 Cheers
 
 On Thu, Jun 25, 2015 at 6:20 AM, Roman Sokolov ole...@gmail.com 
 mailto:ole...@gmail.com wrote:
 Hello!
 I am trying to compute number of triangles with GraphX. But get memory error 
 or heap size, even though the dataset is very small (1Gb). I run the code in 
 spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate 
 machines 8Gb RAM each). So I have 15x more memory than the dataset size is, 
 but it is not enough. What should I do with terabytes sized datasets? How do 
 people process it? Read a lot of documentation and 2 Spark books, and still 
 have no clue :(
 
 Tried to run with the options, no effect:
 ./bin/spark-shell --executor-memory 6g --driver-memory 9g 
 --total-executor-cores 100
 
 The code is simple:
 
 val graph = GraphLoader.edgeListFile(sc, 
 /home/ubuntu/data/soc-LiveJournal1/lj.stdout,  
   edgeStorageLevel = StorageLevel.MEMORY_AND_DISK_SER, 
   vertexStorageLevel = 
 StorageLevel.MEMORY_AND_DISK_SER).partitionBy(PartitionStrategy.RandomVertexCut)
 
 println(graph.numEdges)
 println(graph.numVertices)
 
 val triangleNum = graph.triangleCount().vertices.map(x = x._2).reduce(_ + 
 _)/3
 
 (dataset is from here: 
 http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 
 http://konect.uni-koblenz.de/downloads/tsv/soc-LiveJournal1.tar.bz2 first 
 two lines contain % characters, so have to be removed).
 
 
 UPD: today tried on 32Gb machine (from spark shell again), now got another 
 error:
 
 [Stage 8: (0 + 4) / 
 32]15/06/25 13:03:05 WARN ShippableVertexPartitionOps: Joining two 
 VertexPartitions with different indexes is slow.
 15/06/25 13:03:05 ERROR Executor: Exception in task 3.0 in stage 8.0 (TID 227)
 java.lang.AssertionError: assertion failed
   at scala.Predef$.assert(Predef.scala:165)
   at 
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90)
   at 
 org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87)
   at 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140)
   at 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:133)
   at 
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159)
   at 
 org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 
 
 
 
 
 
 -- 
 Best regards, Roman Sokolov
 
 
 



Re: Extracting k-means cluster values along with centers?

2015-06-13 Thread Robin East
trying again
 On 13 Jun 2015, at 10:15, Robin East robin.e...@xense.co.uk wrote:
 
 Here’s typical way to do it:
 
 
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
 import org.apache.spark.mllib.linalg.Vectors
  
 // Load and parse the data
 val data = sc.textFile(data/mllib/kmeans_data.txt)
 val parsedData = data.map(s = Vectors.dense(s.split(' 
 ').map(_.toDouble))).cache()
  
 // Cluster the data into two classes using KMeans
 val numClusters = 2
 val numIterations = 20
 val model = KMeans.train(parsedData, numClusters, numIterations)
  
 val parsedDataClusters = model.predict(parsedData)
 val dataWithClusters   = parsedData.zip(parsedDataClusters)
 
 
 On 12 Jun 2015, at 23:44, Minnow Noir minnown...@gmail.com 
 mailto:minnown...@gmail.com wrote:
 
 Greetings.
 
 I have been following some of the tutorials online for Spark k-means 
 clustering.  I would like to be able to just dump all the cluster values 
 and their centroids to text file so I can explore the data.  I have the 
 clusters as such:
 
 val clusters = KMeans.train(parsedData, numClusters, numIterations)
 
 clusters
 res2: org.apache.spark.mllib.clustering.KMeansModel = 
 org.apache.spark.mllib.clustering.KMeansModel@59de440b
 
 Is there a way to build something akin to a key value RDD that has the 
 center as the key and the array of values associated with that center as the 
 value? I don't see anything in the tutorials, API docs, or the Learning 
 book for how to do this.  
 
 Thank you
 
 



Re: Linear Regression with SGD

2015-06-09 Thread Robin East
Hi Stephen

How many is a very large number of iterations? SGD is notorious for requiring 
100s or 1000s of iterations, also you may need to spend some time tweaking the 
step-size. In 1.4 there is an implementation of ElasticNet Linear Regression 
which is supposed to compare favourably with an equivalent R implementation.
 On 9 Jun 2015, at 22:05, Stephen Carman scar...@coldlight.com wrote:
 
 Hi User group,
 
 We are using spark Linear Regression with SGD as the optimization technique 
 and we are achieving very sub-optimal results.
 
 Can anyone shed some light on why this implementation seems to produce such 
 poor results vs our own implementation?
 
 We are using a very small dataset, but we have to use a very large number of 
 iterations to achieve similar results to our implementation, we’ve tried 
 normalizing the data
 not normalizing the data and tuning every param. Our implementation is a 
 closed form solution so we should be guaranteed convergence but the spark one 
 is not, which is
 understandable, but why is it so far off?
 
 Has anyone experienced this?
 
 Steve Carman, M.S.
 Artificial Intelligence Engineer
 Coldlight-PTC
 scar...@coldlight.com
 This e-mail is intended solely for the above-mentioned recipient and it may 
 contain confidential or privileged information. If you have received it in 
 error, please notify us immediately and delete the e-mail. You must not copy, 
 distribute, disclose or take any action in reliance on it. In addition, the 
 contents of an attachment to this e-mail may contain software viruses which 
 could damage your own computer system. While ColdLight Solutions, LLC has 
 taken every reasonable precaution to minimize this risk, we cannot accept 
 liability for any damage which you sustain as a result of software viruses. 
 You should perform your own virus checks before opening the attachment.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is LIMIT n in Spark SQL useful?

2015-05-05 Thread Robin East
Michael

Are there plans to add LIMIT push down? It's quite a natural thing to do in 
interactive querying.

Sent from my iPhone

 On 4 May 2015, at 22:57, Michael Armbrust mich...@databricks.com wrote:
 
 The JDBC interface for Spark SQL does not support pushing down limits today.
 
 On Mon, May 4, 2015 at 8:06 AM, Robin East robin.e...@xense.co.uk wrote:
 and a further question - have you tried running this query in pqsl? what’s 
 the performance like there?
 
 On 4 May 2015, at 16:04, Robin East robin.e...@xense.co.uk wrote:
 
 What query are you running. It may be the case that your query requires 
 PosgreSQL to do a large amount of work before identifying the first n rows
 On 4 May 2015, at 15:52, Yi Zhang zhangy...@yahoo.com.INVALID wrote:
 
 I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and 
 improve query performance, but I found it took long time as same as 
 querying not using LIMIT. It let me confused. Anybody know why?
 
 Thanks.
 
 Regards,
 Yi
 


Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
What query are you running. It may be the case that your query requires 
PosgreSQL to do a large amount of work before identifying the first n rows
 On 4 May 2015, at 15:52, Yi Zhang zhangy...@yahoo.com.INVALID wrote:
 
 I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and 
 improve query performance, but I found it took long time as same as querying 
 not using LIMIT. It let me confused. Anybody know why?
 
 Thanks.
 
 Regards,
 Yi



Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
and a further question - have you tried running this query in pqsl? what’s the 
performance like there?
 On 4 May 2015, at 16:04, Robin East robin.e...@xense.co.uk wrote:
 
 What query are you running. It may be the case that your query requires 
 PosgreSQL to do a large amount of work before identifying the first n rows
 On 4 May 2015, at 15:52, Yi Zhang zhangy...@yahoo.com.INVALID 
 mailto:zhangy...@yahoo.com.INVALID wrote:
 
 I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and 
 improve query performance, but I found it took long time as same as querying 
 not using LIMIT. It let me confused. Anybody know why?
 
 Thanks.
 
 Regards,
 Yi
 



Re: GraphX path traversal

2015-03-04 Thread Robin East
Actually your Pregel code works for me:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

val vertexlist = Array((1L,One), (2L,Two), (3L,Three), 
(4L,Four),(5L,Five),(6L,Six))
val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to 4),Edge(4,3,4 to 3), 
Edge(3,2,3 to 2), Edge(2,1,2 to 1))
val vertices: RDD[(VertexId, String)] =  sc.parallelize(vertexlist)
val edges = sc.parallelize(edgelist)
val graph = Graph(vertices, edges)


val parentGraph = Pregel(
  graph.mapVertices((id, attr) = Set[VertexId]()),
  Set[VertexId](),
  Int.MaxValue,
  EdgeDirection.Out)(
(id, attr, msg) = (msg ++ attr),
edge = { if (edge.srcId != edge.dstId) 
  { Iterator((edge.dstId, (edge.srcAttr + edge.srcId))) 
  } 
  else Iterator.empty 
 },
(a, b) = (a ++ b))
parentGraph.vertices.collect.foreach(println(_))

Output:

(4,Set(6, 5))
(1,Set(5, 6, 2, 3, 4))
(5,Set(6))
(6,Set())
(2,Set(6, 5, 4, 3))
(3,Set(6, 5, 4))

Maybe your data.csv has edges the wrong way round

Robin

 On 3 Mar 2015, at 16:32, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 wrote:
 
 Hi,
 
 I have tried below program using pergel API but I'm not able to get my 
 required output. I'm getting exactly reverse output which I'm expecting. 
 
 // Creating graph using above mail mentioned edgefile
  val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc, 
 /home/rajesh/Downloads/graphdata/data.csv).cache()
 
  val parentGraph = Pregel(
   graph.mapVertices((id, attr) = Set[VertexId]()),
   Set[VertexId](),
   Int.MaxValue,
   EdgeDirection.Out)(
 (id, attr, msg) = (msg ++ attr),
 edge = { if (edge.srcId != edge.dstId) 
   { Iterator((edge.dstId, (edge.srcAttr + edge.srcId))) 
   } 
   else Iterator.empty 
  },
 (a, b) = (a ++ b))
 parentGraph.vertices.collect.foreach(println(_))
 
 Output :
 
 (4,Set(1, 2, 3))
 (1,Set())
 (6,Set(5, 1, 2, 3, 4))
 (3,Set(1, 2))
 (5,Set(1, 2, 3, 4))
 (2,Set(1))
 
 But I'm looking below output. 
 
 (4,Set(5, 6))
 (1,Set(2, 3, 4, 5, 6))
 (6,Set())
 (3,Set(4, 5, 6))
 (5,Set(6))
 (2,Set(3, 4, 5, 6))
 
 Could you please correct me where I'm doing wrong.
 
 Regards,
 Rajesh
  
 
 On Tue, Mar 3, 2015 at 8:42 PM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com mailto:mrajaf...@gmail.com wrote:
 Hi Robin,
 
 Thank you for your response. Please find below my question. I have a below 
 edge file
 
 Source Vertex Destination Vertex
 1 2
 2 3
 3 4
 4 5
 5 6
 6 6
 
 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is connected 
 to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th vertex is a 
 root node. Please find below graph
 
 image.png
 In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6. 
 Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6 
 because this is the root node.
 
 I'm planning to use pergel API but I'm not able to define messages and vertex 
 program in that API. Could you please help me on this.
 
 Please let me know if you need more information.
 
 Regards,
 Rajesh
 
 
 On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk 
 mailto:robin.e...@xense.co.uk wrote:
 Rajesh
 
 I'm not sure if I can help you, however I don't even understand the question. 
 Could you restate what you are trying to do.
 
 Sent from my iPhone
 
 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 mailto:mrajaf...@gmail.com wrote:
 
 Hi,
 
 I have a below edge list. How to find the parents path for every vertex?
 
 Example :
 
 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6
 
 Could you please let me know how to do this? (or) Any suggestion
 
 Source VertexDestination Vertex
 12
 23
 34
 45
 56
 
 Regards,
 Rajesh
 
 



Re: PRNG in Scala

2015-03-03 Thread Robin East
And this SO post goes into details on the PRNG in Java

http://stackoverflow.com/questions/9907303/does-java-util-random-implementation-differ-between-jres-or-platforms

 On 3 Mar 2015, at 16:15, Robin East robin.e...@xense.co.uk wrote:
 
 This is more of a java/scala question than spark - it uses java.util.Random : 
 https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/Random.scala
  
 https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/Random.scala
 
 
 On 3 Mar 2015, at 15:08, Vijayasarathy Kannan kvi...@vt.edu 
 mailto:kvi...@vt.edu wrote:
 
 Hi,
 
 What pseudo-random-number generator does scala.util.Random uses?
 



Re: GraphX path traversal

2015-03-03 Thread Robin East
Have you tried EdgeDirection.In?
 On 3 Mar 2015, at 16:32, Robin East robin.e...@xense.co.uk wrote:
 
 What about the following which can be run in spark shell:
 
 import org.apache.spark._
 import org.apache.spark.graphx._
 import org.apache.spark.rdd.RDD
 
 val vertexlist = Array((1L,One), (2L,Two), (3L,Three), 
 (4L,Four),(5L,Five),(6L,Six))
 val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to 4),Edge(4,3,4 to 
 3), Edge(3,2,3 to 2), Edge(2,1,2 to 1))
 val vertices: RDD[(VertexId, String)] =  sc.parallelize(vertexlist)
 val edges = sc.parallelize(edgelist)
 val graph = Graph(vertices, edges)
 
 val triplets = graph.triplets
 
 triplets.foreach(t = println(sparent for ${t.dstId} is ${t.srcId}))
 
 It doesn’t set vertex 6 to have parent 6 but you get the idea.
 
 It doesn’t use Pregel but that sounds like overkill for what you are trying 
 to achieve.
 
 Does that answer your question or were you after something different?
 
 
 
 On 3 Mar 2015, at 15:12, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 mailto:mrajaf...@gmail.com wrote:
 
 Hi Robin,
 
 Thank you for your response. Please find below my question. I have a below 
 edge file
 
 Source VertexDestination Vertex
 12
 23
 34
 45
 56
 66
 
 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is connected 
 to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th vertex is 
 a root node. Please find below graph
 
 image.png
 In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6. 
 Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6 
 because this is the root node.
 
 I'm planning to use pergel API but I'm not able to define messages and 
 vertex program in that API. Could you please help me on this.
 
 Please let me know if you need more information.
 
 Regards,
 Rajesh
 
 
 On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk 
 mailto:robin.e...@xense.co.uk wrote:
 Rajesh
 
 I'm not sure if I can help you, however I don't even understand the 
 question. Could you restate what you are trying to do.
 
 Sent from my iPhone
 
 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 mailto:mrajaf...@gmail.com wrote:
 
 Hi,
 
 I have a below edge list. How to find the parents path for every vertex?
 
 Example :
 
 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6
 
 Could you please let me know how to do this? (or) Any suggestion
 
 Source Vertex   Destination Vertex
 1   2
 2   3
 3   4
 4   5
 5   6
 
 Regards,
 Rajesh



Re: PRNG in Scala

2015-03-03 Thread Robin East
This is more of a java/scala question than spark - it uses java.util.Random : 
https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/Random.scala


 On 3 Mar 2015, at 15:08, Vijayasarathy Kannan kvi...@vt.edu wrote:
 
 Hi,
 
 What pseudo-random-number generator does scala.util.Random uses?



Re: GraphX path traversal

2015-03-03 Thread Robin East
What about the following which can be run in spark shell:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

val vertexlist = Array((1L,One), (2L,Two), (3L,Three), 
(4L,Four),(5L,Five),(6L,Six))
val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to 4),Edge(4,3,4 to 3), 
Edge(3,2,3 to 2), Edge(2,1,2 to 1))
val vertices: RDD[(VertexId, String)] =  sc.parallelize(vertexlist)
val edges = sc.parallelize(edgelist)
val graph = Graph(vertices, edges)

val triplets = graph.triplets

triplets.foreach(t = println(sparent for ${t.dstId} is ${t.srcId}))

It doesn’t set vertex 6 to have parent 6 but you get the idea.

It doesn’t use Pregel but that sounds like overkill for what you are trying to 
achieve.

Does that answer your question or were you after something different?



 On 3 Mar 2015, at 15:12, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 wrote:
 
 Hi Robin,
 
 Thank you for your response. Please find below my question. I have a below 
 edge file
 
 Source Vertex Destination Vertex
 1 2
 2 3
 3 4
 4 5
 5 6
 6 6
 
 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is connected 
 to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th vertex is a 
 root node. Please find below graph
 
 image.png
 In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6. 
 Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6 
 because this is the root node.
 
 I'm planning to use pergel API but I'm not able to define messages and vertex 
 program in that API. Could you please help me on this.
 
 Please let me know if you need more information.
 
 Regards,
 Rajesh
 
 
 On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk 
 mailto:robin.e...@xense.co.uk wrote:
 Rajesh
 
 I'm not sure if I can help you, however I don't even understand the question. 
 Could you restate what you are trying to do.
 
 Sent from my iPhone
 
 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 mailto:mrajaf...@gmail.com wrote:
 
 Hi,
 
 I have a below edge list. How to find the parents path for every vertex?
 
 Example :
 
 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6
 
 Could you please let me know how to do this? (or) Any suggestion
 
 Source VertexDestination Vertex
 12
 23
 34
 45
 56
 
 Regards,
 Rajesh
 



Re: GraphX path traversal

2015-03-03 Thread Robin East
Rajesh

I'm not sure if I can help you, however I don't even understand the question. 
Could you restate what you are trying to do.

Sent from my iPhone

 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com 
 wrote:
 
 Hi,
 
 I have a below edge list. How to find the parents path for every vertex?
 
 Example :
 
 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6
 
 Could you please let me know how to do this? (or) Any suggestion
 
 Source Vertex Destination Vertex
 1 2
 2 3
 3 4
 4 5
 5 6
 
 Regards,
 Rajesh


Re: Spark SQL odbc on Windows

2015-02-23 Thread Robin East
Have you looked at Kylin? 
http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/#.VOtXUUsqnUk

Pretty new but has the backing of eBay.


On 23 Feb 2015, at 15:38, Denny Lee denny.g@gmail.com wrote:

 Makes complete sense - I became a fan of Spark for pretty much the same 
 reasons.  Best of luck, eh?!  
 
 On Mon Feb 23 2015 at 12:08:49 AM Francisco Orchard forch...@gmail.com 
 wrote:
 Hi Denny  Ashic,
 
 You are putting us on the right direction. Thanks! 
 
 We will try following your advice and provide feeback to the list.
 
 Regarding your question Denny. We feel  MS is lacking on an scalable solution 
 for SSAS (tabular or multidim) so when it comes to big data, the only answer 
 they have is their expensive appliance (APS) which can be used as a rolap 
 engine. We are interesting into testing how Spark escalate to check if it can 
 be offered as an less expensive alternative when a single machine is not 
 enough to our client needs. The reason why we do not go with tabular in the 
 first place is because its rolap mode (direct query) is still too limited. 
 And thanks for writing the klout paper!! We were already using it as a 
 guideline for our tests.
 
 Best regards,
 Francisco
 From: Denny Lee
 Sent: ‎22/‎02/‎2015 17:56
 To: Ashic Mahtab; Francisco Orchard; Apache Spark
 Subject: Re: Spark SQL odbc on Windows
 
 Back to thrift, there was an earlier thread on this topic at 
 http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E
  that may be useful as well.
 
 On Sun Feb 22 2015 at 8:42:29 AM Denny Lee denny.g@gmail.com wrote:
 Hi Francisco,
 
 Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular) 
 from SSAS to Spark? As a past SSAS guy you've definitely piqued my interest. 
 
 The one thing that you may run into is that the SQL generated by SSAS can be 
 quite convoluted. When we were doing the same thing to try to get SSAS to 
 connect to Hive (ref paper at 
 http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx)
  that was definitely a blocker. Note that Spark SQL is different than HIVEQL 
 but you may run into the same issue. If so, the trick you may want to use is 
 similar to the paper - use a SQL Server linked server connection and have SQL 
 Server be your translator for the SQL generated by SSAS. 
 
 HTH!
 Denny
 
 On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab as...@live.com wrote:
 Hi Francisco,
 While I haven't tried this, have a look at the contents of 
 start-thriftserver.sh - all it's doing is setting up a few variables and 
 calling:
 
 /bin/spark-submit --class 
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
 
 and passing some additional parameters. Perhaps doing the same would work?
 
 I also believe that this hosts a jdbc server (not odbc), but there's a free 
 odbc connector from databricks built by Simba, with which I've been able to 
 connect to a spark cluster hosted on linux.
 
 -Ashic.
 
 To: user@spark.apache.org
 From: forch...@gmail.com
 Subject: Spark SQL odbc on Windows
 Date: Sun, 22 Feb 2015 09:45:03 +0100
 
 
 Hello, 
 I work on a MS consulting company and we are evaluating including SPARK on 
 our BigData offer. We are particulary interested into testing SPARK as rolap 
 engine for SSAS but we cannot find a way to activate the odbc server (thrift) 
 on a Windows custer. There is no start-thriftserver.sh command available for 
 windows. 
 
 Somebody knows if there is a way to make this work? 
 
 Thanks in advance!!
 Francisco



Re: obtain cluster assignment in K-means

2015-02-12 Thread Robin East
KMeans.train actually returns a KMeansModel so you can use predict() method of 
the model 

e.g. clusters.predict(pointToPredict)
or

clusters.predict(pointsToPredict)

first is a single Vector, 2nd is RDD[Vector]

Robin
On 12 Feb 2015, at 06:37, Shi Yu shiyu@gmail.com wrote:

 Hi there,
 
 I am new to spark.  When training a model using K-means using the following 
 code, how do I obtain the cluster assignment in the next step?
 
 
 val clusters = KMeans.train(parsedData, numClusters, numIterations)
 
 
  I searched around many examples but they mostly calculate the WSSSE. I am 
 still confused. 
 
 Thanks!
  
 Eilian
 
 



Re: is there a master for spark cluster in ec2

2015-02-02 Thread Robin East
There is a file $SPARK_HOME/conf/spark-env.sh which comes readily configured 
with the MASTER variable. So if you start pyspark or spark-shell from the ec2 
login machine you will connect to the Spark master.


On 29 Jan 2015, at 01:11, Mohit Singh mohit1...@gmail.com wrote:

 Hi,
   Probably a naive question.. But I am creating a spark cluster on ec2 using 
 the ec2 scripts in there..
 But is there a master param I need to set..
 ./bin/pyspark --master [ ] ??
 I don't yet fully understand the ec2 concepts so just wanted to confirm this??
 Thanks
 
 -- 
 Mohit
 
 When you want success as badly as you want the air, then you will get it. 
 There is no other secret of success.
 -Socrates


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-22 Thread Robin East
Hi

There are many different variants of gradient descent mostly dealing with what 
the step size is and how it might be adjusted as the algorithm proceeds. Also 
if it uses a stochastic variant (as opposed to batch descent) then there are 
variations there too. I don’t know off-hand what MLlib’s detailed 
implementation is but no doubt there are differences between the two - perhaps 
someone with more knowledge of the internals could comment.

As you can tell from playing around with the parameters, step size is vitally 
important to the performance of the algorithm.


On 22 Jan 2015, at 06:44, Jacques Heunis jaaksem...@gmail.com wrote:

 Ah I see, thanks!
 I was just confused because given the same configuration, I would have 
 thought that Spark and Scikit would give more similar results, but I guess 
 this is simply not the case (as in your example, in order to get spark to 
 give an mse sufficiently close to scikit's you have to give it a 
 significantly larger step and iteration count).
 
 Would that then be a result of MLLib and Scikit differing slightly in their 
 exact implementation of the optimizer? Or rather a case of (as you say) 
 Scikit being a far more mature system (and therefore that MLLib would 'get 
 better' over time)? Surely it is far from ideal that to get the same results 
 you need more iterations (IE more computation), or do you think that that is 
 simply coincidence and that given a different model/dataset it may be the 
 other way around?
 
 I ask because I encountered this situation on other, larger datasets, so this 
 is not an isolated case (though being the simplest example I could think of I 
 would imagine that it's somewhat indicative of general behaviour)
 
 On Thu, Jan 22, 2015 at 1:57 AM, Robin East robin.e...@xense.co.uk wrote:
 I don’t get those results. I get:
 
 spark   0.14
 scikit-learn0.85
 
 The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 
 and push iterations to 400 and you get a mse ~= 0. Of course the coefficients 
 are both ~1 and the intercept ~0. Similarly if you change the mllib step size 
 to 0.5 and number of iterations to 1200 you again get a very low mse. One of 
 the issues with SGD is you have to tweak these parameters to tune the 
 algorithm.
 
 FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is 
 nowhere as mature as scikit learn. However if you have large datasets that 
 won’t sensibly fit the scikit-learn in-core model MLLib is one of the top 
 choices. Similarly if you are running proof of concepts that you are 
 eventually going to scale up to production environments then there is a 
 definite argument for using MLlib at both the PoC and production stages.
 
 
 On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:
 
  I've recently been trying to get to know Apache Spark as a replacement for
  Scikit Learn, however it seems to me that even in simple cases, Scikit
  converges to an accurate model far faster than Spark does.
  For example I generated 1000 data points for a very simple linear function
  (z=x+y) with the following script:
 
  http://pastebin.com/ceRkh3nb
 
  I then ran the following Scikit script:
 
  http://pastebin.com/1aECPfvq
 
  And then this Spark script: (with spark-submit filename, no other
  arguments)
 
  http://pastebin.com/s281cuTL
 
  Strangely though, the error given by spark is an order of magnitude larger
  than that given by Scikit (0.185 and 0.045 respectively) despite the two
  models having a nearly identical setup (as far as I can tell)
  I understand that this is using SGD with very few iterations and so the
  results may differ but I wouldn't have thought that it would be anywhere
  near such a large difference or such a large error, especially given the
  exceptionally simple data.
 
  Is there something I'm misunderstanding in Spark? Is it not correctly
  configured? Surely I should be getting a smaller error than that?
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 



Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Robin East
I don’t get those results. I get:

spark   0.14
scikit-learn0.85

The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and 
push iterations to 400 and you get a mse ~= 0. Of course the coefficients are 
both ~1 and the intercept ~0. Similarly if you change the mllib step size to 
0.5 and number of iterations to 1200 you again get a very low mse. One of the 
issues with SGD is you have to tweak these parameters to tune the algorithm.

FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is 
nowhere as mature as scikit learn. However if you have large datasets that 
won’t sensibly fit the scikit-learn in-core model MLLib is one of the top 
choices. Similarly if you are running proof of concepts that you are eventually 
going to scale up to production environments then there is a definite argument 
for using MLlib at both the PoC and production stages.


On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:

 I've recently been trying to get to know Apache Spark as a replacement for
 Scikit Learn, however it seems to me that even in simple cases, Scikit
 converges to an accurate model far faster than Spark does.
 For example I generated 1000 data points for a very simple linear function
 (z=x+y) with the following script:
 
 http://pastebin.com/ceRkh3nb
 
 I then ran the following Scikit script:
 
 http://pastebin.com/1aECPfvq
 
 And then this Spark script: (with spark-submit filename, no other
 arguments)
 
 http://pastebin.com/s281cuTL
 
 Strangely though, the error given by spark is an order of magnitude larger
 than that given by Scikit (0.185 and 0.045 respectively) despite the two
 models having a nearly identical setup (as far as I can tell)
 I understand that this is using SGD with very few iterations and so the
 results may differ but I wouldn't have thought that it would be anywhere
 near such a large difference or such a large error, especially given the
 exceptionally simple data.
 
 Is there something I'm misunderstanding in Spark? Is it not correctly
 configured? Surely I should be getting a smaller error than that?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Robin East
I’ve just pulled down the latest commits from github, and done the following:

1)
mvn clean package -DskipTests

builds fine

2)
./bin/spark-shell works

3)
run SparkPi example with no problems:

./bin/run-example SparkPi 10

4)
Started a master 

./sbin/start-master.sh

grabbed the MasterWebUI from the master log - Started MasterWebUI at 
http://x.x.x.x:8080

Can view the MasterWebUI from local browser

5)
grabbed the spark url from the master log and started a local slave:

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://hostname:7077 


6)
Ran jps to confirm both Master and Worker processes are present.

7)
Ran SparkPi on the mini-cluster:

MASTER=spark://host:7077 ./bin/run-example SparkPi 10

All worked fine, can see information in the MasterWebUI

Which of these stops doesn’t work for you? I presume you’ve tried re-pulling 
from git and a clean build again.

Robin
On 13 Jan 2015, at 08:07, Ganon Pierce ganon.pie...@me.com wrote:

 After clean build still receiving the same error.
 
 
 
 On Jan 6, 2015, at 3:59 PM, Sean Owen so...@cloudera.com wrote:
 
 FWIW I do not see any such error, after a mvn -DskipTests clean package 
 and ./bin/spark-shell from master. Maybe double-check you have done a full 
 clean build.
 
 On Tue, Jan 6, 2015 at 9:09 PM, Ganon Pierce ganon.pie...@me.com wrote:
 I’m attempting to build from the latest commit on git and receive the 
 following error upon attempting to access the application web ui:
 
 HTTP ERROR: 500
 
 Problem accessing /jobs/. Reason:
 
 Server Error
 Powered by Jetty://
 
 My driver also prints this error:
 
 java.lang.UnsupportedOperationException: empty.max
  at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:216)
  at scala.collection.AbstractTraversable.max(Traversable.scala:105)
  at 
 org.apache.spark.ui.jobs.AllJobsPage.org$apache$spark$ui$jobs$AllJobsPage$$makeRow$1(AllJobsPage.scala:46)
  at 
 org.apache.spark.ui.jobs.AllJobsPage$$anonfun$jobsTable$1.apply(AllJobsPage.scala:91)
  at 
 org.apache.spark.ui.jobs.AllJobsPage$$anonfun$jobsTable$1.apply(AllJobsPage.scala:91)
  at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
  at org.apache.spark.ui.jobs.AllJobsPage.jobsTable(AllJobsPage.scala:91)
  at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:106)
  at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
  at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
  at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
  at 
 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
  at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
  at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
  at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
  at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
  at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
  at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
  at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
  at org.eclipse.jetty.server.Server.handle(Server.java:370)
  at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
  at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
  at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
  at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
  at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
  at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
  at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
  at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
  at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
  at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
  at java.lang.Thread.run(Thread.java:745)
 
 
 Has the ui been disabled intentionally for development purposes, have I not 
 set something up correctly, or is this a bug?