Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread surender kumar
Question was not what kind of sampling but random sampling per user. There's no 
value associated with items to create stratas. If you read Matteo's answer, 
that's the way to go about it.
-Surender 

On Thursday, 12 April, 2018, 5:49:43 PM IST, Gourav Sengupta 
 wrote:  
 
 Hi,
There is an option for Stratified Sampling available in SPARK: 
https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling. 
Also there is a method called randomSplit which may be called on dataframes in 
case we want to split them into training and test data.
Please let me know whether using any of these built in functions helps or not.

Regards,Gourav 
On Thu, Apr 12, 2018 at 3:25 AM, surender kumar  
wrote:

Thanks Matteo, this should work!
-Surender 

On Thursday, 12 April, 2018, 1:13:38 PM IST, Matteo Cossu 
 wrote:  
 
 I don't think it's trivial. Anyway, the naive solution would be a cross join 
between user x items. But this can be very very expensive. I've encountered 
once a similar problem, here how I solved it:   
   - create a new RDD with (itemID, index) where the index is a unique integer 
between 0 and the number of items   

   - for every user sample n items by generating randomly n distinct integers 
between 0 and the number of items (e.g. with rand.randint()), so you have a new 
RDD (userID, [sample_items])
   - flatten all the list in the previously created RDD and join them back with 
the RDD with (itemID, index) using index as join attribute
You can do the same things with DataFrame using UDFs.
On 11 April 2018 at 23:01, surender kumar  wrote:

right, this is what I did when I said I tried to persist and create an RDD out 
of it to sample from. But how to do for each user?You have one rdd of users on 
one hand and rdd of items on the other. How to go from here? Am I missing 
something trivial?  

On Thursday, 12 April, 2018, 2:10:51 AM IST, Matteo Cossu 
 wrote:  
 
 Why broadcasting this list then? You should use an RDD or DataFrame. For 
example, RDD has a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar  wrote:

I'm using pySpark.I've list of 1 million items (all float values ) and 1 
million users. for each user I want to sample randomly some items from the item 
list.Broadcasting the item list results in Outofmemory error on the driver, 
tried setting driver memory till 10G.  I tried to persist this array on disk 
but I'm not able to figure out a way to read the same on the workers.
Any suggestion would be appreciated.

  

  

  

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread Gourav Sengupta
Hi,

There is an option for Stratified Sampling available in SPARK:
https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling
.

Also there is a method called randomSplit which may be called on dataframes
in case we want to split them into training and test data.

Please let me know whether using any of these built in functions helps or
not.


Regards,
Gourav

On Thu, Apr 12, 2018 at 3:25 AM, surender kumar <
skiit...@yahoo.co.uk.invalid> wrote:

> Thanks Matteo, this should work!
>
> -Surender
>
>
> On Thursday, 12 April, 2018, 1:13:38 PM IST, Matteo Cossu <
> elco...@gmail.com> wrote:
>
>
> I don't think it's trivial. Anyway, the naive solution would be a cross
> join between user x items. But this can be very very expensive. I've
> encountered once a similar problem, here how I solved it:
>
>- create a new RDD with (itemID, index) where the index is a unique
>integer between 0 and the number of items
>- for every user sample n items by generating randomly n distinct
>integers between 0 and the number of items (e.g. with rand.randint()), so
>you have a new RDD (userID, [sample_items])
>- flatten all the list in the previously created RDD and join them
>back with the RDD with (itemID, index) using index as join attribute
>
> You can do the same things with DataFrame using UDFs.
>
> On 11 April 2018 at 23:01, surender kumar  wrote:
>
> right, this is what I did when I said I tried to persist and create an RDD
> out of it to sample from. But how to do for each user?
> You have one rdd of users on one hand and rdd of items on the other. How
> to go from here? Am I missing something trivial?
>
>
> On Thursday, 12 April, 2018, 2:10:51 AM IST, Matteo Cossu <
> elco...@gmail.com> wrote:
>
>
> Why broadcasting this list then? You should use an RDD or DataFrame. For
> example, RDD has a method sample() that returns a random sample from it.
>
> On 11 April 2018 at 22:34, surender kumar 
> wrote:
>
> I'm using pySpark.
> I've list of 1 million items (all float values ) and 1 million users. for
> each user I want to sample randomly some items from the item list.
> Broadcasting the item list results in Outofmemory error on the driver,
> tried setting driver memory till 10G.  I tried to persist this array on
> disk but I'm not able to figure out a way to read the same on the workers.
>
> Any suggestion would be appreciated.
>
>
>
>


Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread surender kumar
Thanks Matteo, this should work!
-Surender 

On Thursday, 12 April, 2018, 1:13:38 PM IST, Matteo Cossu 
 wrote:  
 
 I don't think it's trivial. Anyway, the naive solution would be a cross join 
between user x items. But this can be very very expensive. I've encountered 
once a similar problem, here how I solved it:   
   - create a new RDD with (itemID, index) where the index is a unique integer 
between 0 and the number of items   

   - for every user sample n items by generating randomly n distinct integers 
between 0 and the number of items (e.g. with rand.randint()), so you have a new 
RDD (userID, [sample_items])
   - flatten all the list in the previously created RDD and join them back with 
the RDD with (itemID, index) using index as join attribute
You can do the same things with DataFrame using UDFs.
On 11 April 2018 at 23:01, surender kumar  wrote:

right, this is what I did when I said I tried to persist and create an RDD out 
of it to sample from. But how to do for each user?You have one rdd of users on 
one hand and rdd of items on the other. How to go from here? Am I missing 
something trivial?  

On Thursday, 12 April, 2018, 2:10:51 AM IST, Matteo Cossu 
 wrote:  
 
 Why broadcasting this list then? You should use an RDD or DataFrame. For 
example, RDD has a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar  wrote:

I'm using pySpark.I've list of 1 million items (all float values ) and 1 
million users. for each user I want to sample randomly some items from the item 
list.Broadcasting the item list results in Outofmemory error on the driver, 
tried setting driver memory till 10G.  I tried to persist this array on disk 
but I'm not able to figure out a way to read the same on the workers.
Any suggestion would be appreciated.

  

  

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread Matteo Cossu
I don't think it's trivial. Anyway, the naive solution would be a cross
join between user x items. But this can be very very expensive. I've
encountered once a similar problem, here how I solved it:

   - create a new RDD with (itemID, index) where the index is a unique
   integer between 0 and the number of items
   - for every user sample n items by generating randomly n distinct
   integers between 0 and the number of items (e.g. with rand.randint()), so
   you have a new RDD (userID, [sample_items])
   - flatten all the list in the previously created RDD and join them back
   with the RDD with (itemID, index) using index as join attribute

You can do the same things with DataFrame using UDFs.

On 11 April 2018 at 23:01, surender kumar  wrote:

> right, this is what I did when I said I tried to persist and create an RDD
> out of it to sample from. But how to do for each user?
> You have one rdd of users on one hand and rdd of items on the other. How
> to go from here? Am I missing something trivial?
>
>
> On Thursday, 12 April, 2018, 2:10:51 AM IST, Matteo Cossu <
> elco...@gmail.com> wrote:
>
>
> Why broadcasting this list then? You should use an RDD or DataFrame. For
> example, RDD has a method sample() that returns a random sample from it.
>
> On 11 April 2018 at 22:34, surender kumar 
> wrote:
>
> I'm using pySpark.
> I've list of 1 million items (all float values ) and 1 million users. for
> each user I want to sample randomly some items from the item list.
> Broadcasting the item list results in Outofmemory error on the driver,
> tried setting driver memory till 10G.  I tried to persist this array on
> disk but I'm not able to figure out a way to read the same on the workers.
>
> Any suggestion would be appreciated.
>
>
>


Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread surender kumar
right, this is what I did when I said I tried to persist and create an RDD out 
of it to sample from. But how to do for each user?You have one rdd of users on 
one hand and rdd of items on the other. How to go from here? Am I missing 
something trivial?  

On Thursday, 12 April, 2018, 2:10:51 AM IST, Matteo Cossu 
 wrote:  
 
 Why broadcasting this list then? You should use an RDD or DataFrame. For 
example, RDD has a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar  wrote:

I'm using pySpark.I've list of 1 million items (all float values ) and 1 
million users. for each user I want to sample randomly some items from the item 
list.Broadcasting the item list results in Outofmemory error on the driver, 
tried setting driver memory till 10G.  I tried to persist this array on disk 
but I'm not able to figure out a way to read the same on the workers.
Any suggestion would be appreciated.

  

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread Matteo Cossu
Why broadcasting this list then? You should use an RDD or DataFrame. For
example, RDD has a method sample() that returns a random sample from it.

On 11 April 2018 at 22:34, surender kumar 
wrote:

> I'm using pySpark.
> I've list of 1 million items (all float values ) and 1 million users. for
> each user I want to sample randomly some items from the item list.
> Broadcasting the item list results in Outofmemory error on the driver,
> tried setting driver memory till 10G.  I tried to persist this array on
> disk but I'm not able to figure out a way to read the same on the workers.
>
> Any suggestion would be appreciated.
>


Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread surender kumar
I'm using pySpark.I've list of 1 million items (all float values ) and 1 
million users. for each user I want to sample randomly some items from the item 
list.Broadcasting the item list results in Outofmemory error on the driver, 
tried setting driver memory till 10G.  I tried to persist this array on disk 
but I'm not able to figure out a way to read the same on the workers.
Any suggestion would be appreciated.