Re: Incremently load big RDD file into Memory

2015-04-09 Thread MUHAMMAD AAMIR
Hi,

Thanks a lot for such a detailed response.

On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel guillaume.pi...@exensa.com
wrote:

  Hi Muhammad,

 There are lots of ways to do it. My company actually develops a text
 mining solution which embeds a very fast Approximate Neighbours solution (a
 demo with real time queries on the wikipedia dataset can be seen at
 wikinsights.org). For the record, we now prepare a dataset of 4.5 million
 documents for querying in about 2 or 3 minutes on a 32 cores cluster, and
 the queries take less than 10ms when the dataset is in memory.

 But if you just want to precompute everything and don't mind waiting a few
 tens of minutes (or hours), and don't want to bother with an approximate
 neighbour solution, then the best way is probably something like this :

 1 - block your data (i.e. group your items in X large groups). Instead of
 a dataset of N elements, you should now have a dataset of X blocks
 containing N/X elements each.
 2 - do the cartesian product (instead of N*N elements, you now have just
 X*X blocks, which should take less memory)
 3 - for each pair of blocks (blockA,blockB), perform the computation of
 distances for each elements of blockA with each element of blockB, but keep
 only the top K best for each element of blockA. Output is
 List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..)
 4 - reduceByKey (the key is the elementOfBlockA), by merging the
 listOfNearestElements and always keeping the K nearest.

 This is an exact version of top K. This is only interesting if K  N/X.
 But even if K is large, it is possible that it will fit your needs.
 Remember that you will still compute N*N distances (this is the problem
 with exact nearest neighbours), the only difference with what you're doing
 now is that you produces less items and duplicates less data. Indeed, if
 one of your elements takes 100bytes, the per element cartesian will produce
 N*N*100*2 bytes, while the blocked version will produce X*X*100*2*N/X, ie
 X*N*100*2 bytes.

 Guillaume

 Hi Guillaume,

  Thanks for you reply. Can you please tell me how can i improve for Top-k
 nearest points.

  P.S. My post is not accepted on the list thats why i am sending you
 email here.
 I would be really grateful to you if you reply it.
 Thanks,

 On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel 
 guillaume.pi...@exensa.com wrote:

  This kind of operation is not scalable, not matter what you do, at
 least if you _really_ want to do that.

 However, if what you're looking for is not to really compute all
 distances, (for instance if you're looking only for the top K nearest
 points), then it can be highly improved.

 It all depends of what you want to do eventually.

 Guillaume

 val locations = filelines.map(line = line.split(\t)).map(t =
 (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()

 val cartesienProduct=locations.cartesian(locations).map(t=
 Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))

 Code executes perfectly fine uptill here but when i try to use
 cartesienProduct it got stuck i.e.

 val count =cartesienProduct.count()

 Any help to efficiently do this will be highly appreciated.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 --
[image: eXenSa]
  *Guillaume PITEL, Président*
 +33(0)626 222 431

 eXenSa S.A.S. http://www.exensa.com/
  41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705




  --
  Regards,
 Muhammad Aamir


 *CONFIDENTIALITY:This email is intended solely for the person(s) named and
 may be confidential and/or privileged.If you are not the intended
 recipient,please delete it,notify me and do not copy,use,or disclose its
 content.*



 --
[image: eXenSa]
  *Guillaume PITEL, Président*
 +33(0)626 222 431

 eXenSa S.A.S. http://www.exensa.com/
  41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705




-- 
Regards,
Muhammad Aamir


*CONFIDENTIALITY:This email is intended solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*


Re: Incremently load big RDD file into Memory

2015-04-08 Thread Guillaume Pitel

Hi Muhammad,

There are lots of ways to do it. My company actually develops a text 
mining solution which embeds a very fast Approximate Neighbours solution 
(a demo with real time queries on the wikipedia dataset can be seen at 
wikinsights.org). For the record, we now prepare a dataset of 4.5 
million documents for querying in about 2 or 3 minutes on a 32 cores 
cluster, and the queries take less than 10ms when the dataset is in memory.


But if you just want to precompute everything and don't mind waiting a 
few tens of minutes (or hours), and don't want to bother with an 
approximate neighbour solution, then the best way is probably something 
like this :


1 - block your data (i.e. group your items in X large groups). Instead 
of a dataset of N elements, you should now have a dataset of X blocks 
containing N/X elements each.
2 - do the cartesian product (instead of N*N elements, you now have 
just X*X blocks, which should take less memory)
3 - for each pair of blocks (blockA,blockB), perform the computation of 
distances for each elements of blockA with each element of blockB, but 
keep only the top K best for each element of blockA. Output is 
List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..)
4 - reduceByKey (the key is the elementOfBlockA), by merging the 
listOfNearestElements and always keeping the K nearest.


This is an exact version of top K. This is only interesting if K  N/X. 
But even if K is large, it is possible that it will fit your needs. 
Remember that you will still compute N*N distances (this is the problem 
with exact nearest neighbours), the only difference with what you're 
doing now is that you produces less items and duplicates less data. 
Indeed, if one of your elements takes 100bytes, the per element 
cartesian will produce N*N*100*2 bytes, while the blocked version will 
produce X*X*100*2*N/X, ie X*N*100*2 bytes.


Guillaume

Hi Guillaume,

Thanks for you reply. Can you please tell me how can i improve for 
Top-k nearest points.


P.S. My post is not accepted on the list thats why i am sending you 
email here.

I would be really grateful to you if you reply it.
Thanks,

On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel 
guillaume.pi...@exensa.com mailto:guillaume.pi...@exensa.com wrote:


This kind of operation is not scalable, not matter what you do, at
least if you _really_ want to do that.

However, if what you're looking for is not to really compute all
distances, (for instance if you're looking only for the top K
nearest points), then it can be highly improved.

It all depends of what you want to do eventually.

Guillaume

val locations = filelines.map(line = line.split(\t)).map(t =
(t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()

val cartesienProduct=locations.cartesian(locations).map(t=

Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))

Code executes perfectly fine uptill here but when i try to use
cartesienProduct it got stuck i.e.

val count =cartesienProduct.count()

Any help to efficiently do this will be highly appreciated.



--
View this message in 
context:http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail:user-unsubscr...@spark.apache.org  
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail:user-h...@spark.apache.org  
mailto:user-h...@spark.apache.org




-- 
eXenSa



*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705




--
Regards,
Muhammad Aamir


/CONFIDENTIALITY:This email is intended solely for the person(s) named 
and may be confidential and/or privileged.If you are not the intended 
recipient,please delete it,notify me and do not copy,use,or disclose 
its content./



--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



RE: Incremently load big RDD file into Memory

2015-04-07 Thread java8964
cartesian is an expensive operation. If you have 'M' records in location, then 
locations. cartesian(locations) will generate MxM result.If locations is a big 
RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong
 Date: Tue, 7 Apr 2015 10:04:12 -0700
 From: mas.ha...@gmail.com
 To: user@spark.apache.org
 Subject: Incremently load big RDD file into Memory
 
 
 val locations = filelines.map(line = line.split(\t)).map(t =
 (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()
 
 val cartesienProduct=locations.cartesian(locations).map(t=
 Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))
 
 Code executes perfectly fine uptill here but when i try to use
 cartesienProduct it got stuck i.e.
 
 val count =cartesienProduct.count()
 
 Any help to efficiently do this will be highly appreciated.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org