date:20150628

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-28 Thread Iulian Dragoș

This is something we (at Typesafe) also thought about, but didn't start
yet. It would be good to pool efforts.

On Sat, Jun 27, 2015 at 12:44 AM, Dave Ariens dari...@blackberry.com
wrote:

  Fair. I will look into an alternative with a generated delegation token.
   However the same issue exists.   How can I have the executor run some
 arbitrary code when it gets a task assignment and before it proceeds to
 process it's resources?

 *From: *Marcelo Vanzin
 *Sent: *Friday, June 26, 2015 6:20 PM
 *To: *Dave Ariens
 *Cc: *Tim Chen; Olivier Girardot; user@spark.apache.org
 *Subject: *Re: Accessing Kerberos Secured HDFS Resources from Spark on
 Mesos

   On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com
 wrote:

  Would there be any way to have the task instances in the slaves call
 the UGI login with a principal/keytab provided to the driver?


  That would only work with a very small number of executors. If you have
 many login requests in a short period of time with the same principal, the
 KDC will start to deny logins. That's why delegation tokens are used
 instead of explicit logins.

  --
 Marcelo




-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com

Re: Spark-Submit / Spark-Shell Error Standalone cluster

2015-06-28 Thread Tomas Hudik

/usr/bin/ - looks like strange directory. Did you copy some files to
/usr/bin yourself?
If you download (possible compile) spark - it will never be placed into
/usr/bin

On Sun, Jun 28, 2015 at 9:19 AM, Wojciech Pituła w.pit...@gmail.com wrote:

 I assume that /usr/bin/load-spark-env.sh exists. Have you got the rights
 to execute it?

 niedz., 28.06.2015 o 04:53 użytkownik Ashish Soni asoni.le...@gmail.com
 napisał:

 Not sure what is the issue but when i run the spark-submit or spark-shell
 i am getting below error

 /usr/bin/spark-class: line 24: /usr/bin/load-spark-env.sh: No such file
 or directory

 Can some one please help

 Thanks,

Re: Spark-Submit / Spark-Shell Error Standalone cluster

2015-06-28 Thread Wojciech Pituła

I assume that /usr/bin/load-spark-env.sh exists. Have you got the rights to
execute it?

niedz., 28.06.2015 o 04:53 użytkownik Ashish Soni asoni.le...@gmail.com
napisał:

 Not sure what is the issue but when i run the spark-submit or spark-shell
 i am getting below error

 /usr/bin/spark-class: line 24: /usr/bin/load-spark-env.sh: No such file or
 directory

 Can some one please help

 Thanks,

problem for submitting job

2015-06-28 Thread 郭谦

HI,

I'm a junior user of spark from China.

I have a problem about submit spark job right now. I want to submit job
from code.

In other words ,How to submit spark job from within java program to yarn
cluster without using spark-submit


   I've learnt from official site
http://spark.apache.org/docs/latest/submitting-applications.html

that using  bin/spark-submit script to submit a job to cluster is easy .


   Because the script may does lots of complex work such as setting up
the classpath with Spark and its dependencies.

If I don't use the script ,I have to deal with all complex work by
myself.It makes me feel really frustrated.


   I have search this problem from Google,but the answers may not suit
for me .


   In hadoop developing ,I know that after setting up Configuration
,Job and resources ,

we can submit hadoop job by coding like this:

job.waitForCompletion

It is convenient for users to submit job programmatically


I want to know if there is a schedule( may be in spark 1.5+?)that provide
users variety ways of submitting job like hadoop .

Like monitoring ,In the recent release spark(1.4.0) We can get statements
about spark applications by REST API right now.


Thanks  Regards

GUO QIAN

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin

Ayman - it's really a question of recommending user to products vs products
to users. There will only be a difference if you're not doing All to All.
For example, if you're recommending only the Top N recommendations. Then
you may recommend only the top N products or the top N users which would be
different.
On Sun, Jun 28, 2015 at 8:34 AM Ayman Farahat ayman.fara...@yahoo.com
wrote:

 Thanks Ilya
 Is there an advantage of say partitioning by users /products when you
 train ?
 Here are two alternatives I have

 #Partition by user or Product
 tot = newrdd.map(lambda l:
 (l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache()
 ratings = tot.values()
 model = ALS.train(ratings, rank, numIterations)

 #use zipwithIndex

 tot = newrdd.map(lambda l: (l[1],Rating(int(l[1]),int(l[2]),l[4])))
 bob = tot.zipWithIndex().map(lambda x : (x[1] ,x[0])).partitionBy(30)
 ratings = bob.values()
 model = ALS.train(ratings, rank, numIterations)


 On Jun 28, 2015, at 8:24 AM, Ilya Ganelin ilgan...@gmail.com wrote:

 You can also select pieces of your RDD by first doing a zipWithIndex and
 then doing a filter operation on the second element of the RDD.

 For example to select the first 100 elements :

 Val a = rdd.zipWithIndex().filter(s = 1  s  100)
 On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat 
 ayman.fara...@yahoo.com.invalid wrote:

 How do you partition by product in Python?
 the only API is partitionBy(50)

 On Jun 18, 2015, at 8:42 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Also in my experiments, it's much faster to blocked BLAS through
 cartesian rather than doing sc.union. Here are the details on the
 experiments:

 https://issues.apache.org/jira/browse/SPARK-4823

 On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Also not sure how threading helps here because Spark puts a partition to
 each core. On each core may be there are multiple threads if you are using
 intel hyperthreading but I will let Spark handle the threading.

 On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
 dgemm based calculation.

 On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
 ayman.fara...@yahoo.com.invalid wrote:

 Thanks Sabarish and Nick
 Would you happen to have some code snippets that you can share.
 Best
 Ayman

 On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Nick is right. I too have implemented this way and it works just fine.
 In my case, there can be even more products. You simply broadcast blocks 
 of
 products to userFeatures.mapPartitions() and BLAS multiply in there to get
 recommendations. In my case 10K products form one block. Note that you
 would then have to union your recommendations. And if there lots of 
 product
 blocks, you might also want to checkpoint once every few times.

 Regards
 Sab

 On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 One issue is that you broadcast the product vectors and then do a dot
 product one-by-one with the user vector.

 You should try forming a matrix of the item vectors and doing the dot
 product as a matrix-vector multiply which will make things a lot faster.

 Another optimisation that is avalailable on 1.4 is a
 recommendProducts method that blockifies the factors to make use of 
 level 3
 BLAS (ie matrix-matrix multiply). I am not sure if this is available in 
 The
 Python api yet.

 But you can do a version yourself by using mapPartitions over user
 factors, blocking the factors into sub-matrices and doing matrix multiply
 with item factor matrix to get scores on a block-by-block basis.

 Also as Ilya says more parallelism can help. I don't think it's so
 necessary to do LSH with 30,000 items.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:

 Actually talk about this exact thing in a blog post here
 http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
 Keep in mind, you're actually doing a ton of math. Even with proper 
 caching
 and use of broadcast variables this will take a while defending on the 
 size
 of your cluster. To get real results you may want to look into locality
 sensitive hashing to limit your search space and definitely look into
 spinning up multiple threads to process your product features in 
 parallel
 to increase resource utilization on the cluster.



 Thank you,
 Ilya Ganelin



 -Original Message-
 *From: *afarahat [ayman.fara...@yahoo.com]
 *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Matrix Multiplication and mllib.recommendation

 Hello;
 I am trying to get predictions after running the ALS model.
 The model works fine. In the prediction/recommendation , I have
 about 30
 ,000 products and 90

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-28 Thread Steve Loughran


On 27 Jun 2015, at 07:56, Tim Chen 
t...@mesosphere.iomailto:t...@mesosphere.io wrote:

Does YARN provide the token through that env variable you mentioned? Or how 
does YARN do this?



Roughly:

1. client-side launcher creates the delegation tokens and adds them as byte[] 
data to the the request.
2. The YARN RM uses the HDFS token for the localisation, so the node managers 
can access the content the user has the rights to.
3. There's some other stuff related to token refresh of restarted app masters, 
essentially guaranteeing that even an AM restarted 3 days after the first 
launch will still have current credentials.
4. It's the duty of the launched App master to download those delegated tokens 
and make use of them. partly through the UGI stuff, also through other 
mechanisms (example, a subset of the tokens are usually passed to the launched 
containers)
5. It's also the duty of the launched AM to deal with token renewal and expiry. 
Short-lived ( 72h) apps don't have to worry about this -making the jump to 
long lived services adds a lot of extra work (which is in Spark 1.4)


Tim

On Fri, Jun 26, 2015 at 3:51 PM, Marcelo Vanzin 
van...@cloudera.commailto:van...@cloudera.com wrote:
On Fri, Jun 26, 2015 at 3:44 PM, Dave Ariens 
dari...@blackberry.commailto:dari...@blackberry.com wrote:
Fair. I will look into an alternative with a generated delegation token.   
However the same issue exists.   How can I have the executor run some arbitrary 
code when it gets a task assignment and before it proceeds to process it's 
resources?

Hmm, good question. If it doesn't already, Mesos could have its own 
implementation of CoarseGrainedExecutorBackend that provides that 
functionality. The only difference is that you'd run something before the 
executor starts up, not before each task.

YARN actually doesn't do it that way; YARN provides the tokens to the executor 
before the process starts, so that when you call 
UserGroupInformation.getCurrentUser() the tokens are already there.

One way of doing that is by writing the tokens to a file and setting the 
KRB5CCNAME env variable when starting the process. You can check the Hadoop 
sources for details. Not sure if there's another way.



From: Marcelo Vanzin
Sent: Friday, June 26, 2015 6:20 PM
To: Dave Ariens
Cc: Tim Chen; Olivier Girardot; 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos


On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens 
dari...@blackberry.commailto:dari...@blackberry.com wrote:
Would there be any way to have the task instances in the slaves call the UGI 
login with a principal/keytab provided to the driver?

That would only work with a very small number of executors. If you have many 
login requests in a short period of time with the same principal, the KDC will 
start to deny logins. That's why delegation tokens are used instead of explicit 
logins.

--
Marcelo



--
Marcelo

Re: required: org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

2015-06-28 Thread Ted Yu

Can you show us your code around line 100 ?

Which Spark release are you compiling against ?

Cheers

On Sun, Jun 28, 2015 at 5:49 AM, Arthur Chan arthur.hk.c...@gmail.com
wrote:

 Hi,

 I am trying Spark with some sample programs,


 In my code, the following items are imported:

 import
 org.apache.spark.mllib.regression.{StreamingLinearRegressionWithSGD,
 LabeledPoint}

 import org.apache.spark.mllib.regression.{StreamingLinearRegressionWithSGD}

 import org.apache.spark.streaming.{Seconds, StreamingContext}

 import scala.util.Random

 I got following error:

 [error] StreamingModel.scala:100: type mismatch;

 [error]  found   :
 org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.regression.LabeledPoint]

 [error]  required:
 org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

 [error] model.predictOn(labeledStream).print()

 [error] ^

 [error] one error found

 [error] (compile:compile) Compilation failed


 Any idea?


 Regards

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin

Oops - code should be :

Val a = rdd.zipWithIndex().filter(s = 1 s._2 100)

On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin ilgan...@gmail.com wrote:

You can also select pieces of your RDD by first doing a zipWithIndex and
then doing a filter operation on the second element of the RDD.

For example to select the first 100 elements :

Val a = rdd.zipWithIndex().filter(s = 1 s 100)
On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat
ayman.fara...@yahoo.com.invalid wrote:

How do you partition by product in Python?
the only API is partitionBy(50)

On Jun 18, 2015, at 8:42 AM, Debasish Das debasish.da...@gmail.com
wrote:

Also in my experiments, it's much faster to blocked BLAS through
cartesian rather than doing sc.union. Here are the details on the
experiments:

https://issues.apache.org/jira/browse/SPARK-4823

On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com
wrote:

Also not sure how threading helps here because Spark puts a partition to
each core. On each core may be there are multiple threads if you are using
intel hyperthreading but I will let Spark handle the threading.

On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
wrote:

We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
dgemm based calculation.

On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat
ayman.fara...@yahoo.com.invalid wrote:

Thanks Sabarish and Nick
Would you happen to have some code snippets that you can share.
Best
Ayman

On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:

Nick is right. I too have implemented this way and it works just fine.
In my case, there can be even more products. You simply broadcast blocks
of
products to userFeatures.mapPartitions() and BLAS multiply in there to get
recommendations. In my case 10K products form one block. Note that you
would then have to union your recommendations. And if there lots of
product
blocks, you might also want to checkpoint once every few times.

Regards
Sab

On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:

One issue is that you broadcast the product vectors and then do a dot
product one-by-one with the user vector.

You should try forming a matrix of the item vectors and doing the dot
product as a matrix-vector multiply which will make things a lot faster.

Another optimisation that is avalailable on 1.4 is a
recommendProducts method that blockifies the factors to make use of
level 3
BLAS (ie matrix-matrix multiply). I am not sure if this is available in
The
Python api yet.

But you can do a version yourself by using mapPartitions over user
factors, blocking the factors into sub-matrices and doing matrix multiply
with item factor matrix to get scores on a block-by-block basis.

Also as Ilya says more parallelism can help. I don't think it's so
necessary to do LSH with 30,000 items.

—
Sent from Mailbox https://www.dropbox.com/mailbox

On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:

Actually talk about this exact thing in a blog post here
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
Keep in mind, you're actually doing a ton of math. Even with proper
caching
and use of broadcast variables this will take a while defending on the
size
of your cluster. To get real results you may want to look into locality
sensitive hashing to limit your search space and definitely look into
spinning up multiple threads to process your product features in
parallel
to increase resource utilization on the cluster.

Thank you,
Ilya Ganelin

-Original Message-
*From: *afarahat [ayman.fara...@yahoo.com]
*Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
*To: *user@spark.apache.org
*Subject: *Matrix Multiplication and mllib.recommendation

Hello;
I am trying to get predictions after running the ALS model.
The model works fine. In the prediction/recommendation , I have
about 30
,000 products and 90 Millions users.
When i try the predict all it fails.
I have been trying to formulate the problem as a Matrix
multiplication where
I first get the product features, broadcast them and then do a dot
product.
Its still very slow. Any reason why
here is a sample code

def doMultiply(x):
a = []
#multiply by
mylen = len(pf.value)
for i in range(mylen) :
myprod = numpy.dot(x,pf.value[i][1])
a.append(myprod)
return a

myModel = MatrixFactorizationModel.load(sc, FlurryModelPath)
#I need to select which products to broadcast but lets try all
m1 = myModel.productFeatures().sample(False, 0.001)
pf = sc.broadcast(m1.collect())
uf = myModel.userFeatures()
f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))

--
View this message in context:

What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread YaoPau

I've heard Spark is not just MapReduce mentioned during Spark talks, but it
seems like every method that Spark has is really doing something like (Map
- Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the
performance benefit of keeping RDDs in memory between stages.

Am I wrong about that?  Is Spark doing anything more efficiently than a
series of Maps followed by a Reduce in memory?  What methods does Spark have
that can't easily be mapped (with somewhat similar efficiency) to Map and
Reduce in memory?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin

You can also select pieces of your RDD by first doing a zipWithIndex and
then doing a filter operation on the second element of the RDD.

For example to select the first 100 elements :

Val a = rdd.zipWithIndex().filter(s = 1 s 100)
On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat
ayman.fara...@yahoo.com.invalid wrote:

How do you partition by product in Python?
the only API is partitionBy(50)

On Jun 18, 2015, at 8:42 AM, Debasish Das debasish.da...@gmail.com
wrote:

Also in my experiments, it's much faster to blocked BLAS through cartesian
rather than doing sc.union. Here are the details on the experiments:

https://issues.apache.org/jira/browse/SPARK-4823

On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com
wrote:

On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
wrote:

We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
dgemm based calculation.

On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat
ayman.fara...@yahoo.com.invalid wrote:

Thanks Sabarish and Nick
Would you happen to have some code snippets that you can share.
Best
Ayman

On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:

Nick is right. I too have implemented this way and it works just fine.
In my case, there can be even more products. You simply broadcast blocks of
products to userFeatures.mapPartitions() and BLAS multiply in there to get
recommendations. In my case 10K products form one block. Note that you
would then have to union your recommendations. And if there lots of product
blocks, you might also want to checkpoint once every few times.

Regards
Sab

On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:

One issue is that you broadcast the product vectors and then do a dot
product one-by-one with the user vector.

You should try forming a matrix of the item vectors and doing the dot
product as a matrix-vector multiply which will make things a lot faster.

Another optimisation that is avalailable on 1.4 is a recommendProducts
method that blockifies the factors to make use of level 3 BLAS (ie
matrix-matrix multiply). I am not sure if this is available in The Python
api yet.

Also as Ilya says more parallelism can help. I don't think it's so
necessary to do LSH with 30,000 items.

—
Sent from Mailbox https://www.dropbox.com/mailbox

On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:

Actually talk about this exact thing in a blog post here
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
Keep in mind, you're actually doing a ton of math. Even with proper
caching
and use of broadcast variables this will take a while defending on the
size
of your cluster. To get real results you may want to look into locality
sensitive hashing to limit your search space and definitely look into
spinning up multiple threads to process your product features in parallel
to increase resource utilization on the cluster.

Thank you,
Ilya Ganelin

Hello;
I am trying to get predictions after running the ALS model.
The model works fine. In the prediction/recommendation , I have about
30
,000 products and 90 Millions users.
When i try the predict all it fails.
I have been trying to formulate the problem as a Matrix
multiplication where
I first get the product features, broadcast them and then do a dot
product.
Its still very slow. Any reason why
here is a sample code

def doMultiply(x):
a = []
#multiply by
mylen = len(pf.value)
for i in range(mylen) :
myprod = numpy.dot(x,pf.value[i][1])
a.append(myprod)
return a

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com http://nabble.com/.

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ayman Farahat

Thanks Ilya
Is there an advantage of say partitioning by users /products when you train ?
Here are two alternatives I have 

#Partition by user or Product 
tot = newrdd.map(lambda l: 
(l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache()
ratings = tot.values()
model = ALS.train(ratings, rank, numIterations)

#use zipwithIndex

tot = newrdd.map(lambda l: (l[1],Rating(int(l[1]),int(l[2]),l[4])))
bob = tot.zipWithIndex().map(lambda x : (x[1] ,x[0])).partitionBy(30)
ratings = bob.values()
model = ALS.train(ratings, rank, numIterations)


On Jun 28, 2015, at 8:24 AM, Ilya Ganelin ilgan...@gmail.com wrote:

 You can also select pieces of your RDD by first doing a zipWithIndex and then 
 doing a filter operation on the second element of the RDD. 
 
 For example to select the first 100 elements :
 
 Val a = rdd.zipWithIndex().filter(s = 1  s  100)
 On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat 
 ayman.fara...@yahoo.com.invalid wrote:
 How do you partition by product in Python?
 the only API is partitionBy(50)
 
 On Jun 18, 2015, at 8:42 AM, Debasish Das debasish.da...@gmail.com wrote:
 
 Also in my experiments, it's much faster to blocked BLAS through cartesian 
 rather than doing sc.union. Here are the details on the experiments:
 
 https://issues.apache.org/jira/browse/SPARK-4823
 
 On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com 
 wrote:
 Also not sure how threading helps here because Spark puts a partition to 
 each core. On each core may be there are multiple threads if you are using 
 intel hyperthreading but I will let Spark handle the threading.  
 
 On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com 
 wrote:
 We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS 
 dgemm based calculation.
 
 On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
 ayman.fara...@yahoo.com.invalid wrote:
 Thanks Sabarish and Nick
 Would you happen to have some code snippets that you can share. 
 Best
 Ayman
 
 On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:
 
 Nick is right. I too have implemented this way and it works just fine. In 
 my case, there can be even more products. You simply broadcast blocks of 
 products to userFeatures.mapPartitions() and BLAS multiply in there to get 
 recommendations. In my case 10K products form one block. Note that you 
 would then have to union your recommendations. And if there lots of product 
 blocks, you might also want to checkpoint once every few times.
 
 Regards
 Sab
 
 On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath nick.pentre...@gmail.com 
 wrote:
 One issue is that you broadcast the product vectors and then do a dot 
 product one-by-one with the user vector.
 
 You should try forming a matrix of the item vectors and doing the dot 
 product as a matrix-vector multiply which will make things a lot faster.
 
 Another optimisation that is avalailable on 1.4 is a recommendProducts 
 method that blockifies the factors to make use of level 3 BLAS (ie 
 matrix-matrix multiply). I am not sure if this is available in The Python 
 api yet. 
 
 But you can do a version yourself by using mapPartitions over user factors, 
 blocking the factors into sub-matrices and doing matrix multiply with item 
 factor matrix to get scores on a block-by-block basis.
 
 Also as Ilya says more parallelism can help. I don't think it's so 
 necessary to do LSH with 30,000 items.
 
 —
 Sent from Mailbox
 
 
 On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
 ilya.gane...@capitalone.com wrote:
 
 Actually talk about this exact thing in a blog post here 
 http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
  Keep in mind, you're actually doing a ton of math. Even with proper 
 caching and use of broadcast variables this will take a while defending on 
 the size of your cluster. To get real results you may want to look into 
 locality sensitive hashing to limit your search space and definitely look 
 into spinning up multiple threads to process your product features in 
 parallel to increase resource utilization on the cluster.
 
 
 
 Thank you,
 Ilya Ganelin
 
 
 
 -Original Message-
 From: afarahat [ayman.fara...@yahoo.com]
 Sent: Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
 To: user@spark.apache.org
 Subject: Matrix Multiplication and mllib.recommendation
 
 Hello;
 I am trying to get predictions after running the ALS model.
 The model works fine. In the prediction/recommendation , I have about 30
 ,000 products and 90 Millions users.
 When i try the predict all it fails.
 I have been trying to formulate the problem as a Matrix multiplication where
 I first get the product features, broadcast them and then do a dot product.
 Its still very slow. Any reason why
 here is a sample code
 
 def doMultiply(x):
 a = []
 #multiply by
 mylen = len(pf.value)
 for i in range(mylen) :

required: org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

2015-06-28 Thread Arthur Chan

Hi,

I am trying Spark with some sample programs,


In my code, the following items are imported:

import org.apache.spark.mllib.regression.{StreamingLinearRegressionWithSGD,
LabeledPoint}

import org.apache.spark.mllib.regression.{StreamingLinearRegressionWithSGD}

import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.util.Random

I got following error:

[error] StreamingModel.scala:100: type mismatch;

[error]  found   :
org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.regression.LabeledPoint]

[error]  required:
org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

[error] model.predictOn(labeledStream).print()

[error] ^

[error] one error found

[error] (compile:compile) Compilation failed


Any idea?


Regards

49 matches

Mail list logo