Re: Incremently load big RDD file into Memory
Hi, Thanks a lot for such a detailed response. On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi Muhammad, There are lots of ways to do it. My company actually develops a text mining solution which embeds a very fast Approximate Neighbours solution (a demo with real time queries on the wikipedia dataset can be seen at wikinsights.org). For the record, we now prepare a dataset of 4.5 million documents for querying in about 2 or 3 minutes on a 32 cores cluster, and the queries take less than 10ms when the dataset is in memory. But if you just want to precompute everything and don't mind waiting a few tens of minutes (or hours), and don't want to bother with an approximate neighbour solution, then the best way is probably something like this : 1 - block your data (i.e. group your items in X large groups). Instead of a dataset of N elements, you should now have a dataset of X blocks containing N/X elements each. 2 - do the cartesian product (instead of N*N elements, you now have just X*X blocks, which should take less memory) 3 - for each pair of blocks (blockA,blockB), perform the computation of distances for each elements of blockA with each element of blockB, but keep only the top K best for each element of blockA. Output is List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..) 4 - reduceByKey (the key is the elementOfBlockA), by merging the listOfNearestElements and always keeping the K nearest. This is an exact version of top K. This is only interesting if K N/X. But even if K is large, it is possible that it will fit your needs. Remember that you will still compute N*N distances (this is the problem with exact nearest neighbours), the only difference with what you're doing now is that you produces less items and duplicates less data. Indeed, if one of your elements takes 100bytes, the per element cartesian will produce N*N*100*2 bytes, while the blocked version will produce X*X*100*2*N/X, ie X*N*100*2 bytes. Guillaume Hi Guillaume, Thanks for you reply. Can you please tell me how can i improve for Top-k nearest points. P.S. My post is not accepted on the list thats why i am sending you email here. I would be really grateful to you if you reply it. Thanks, On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: This kind of operation is not scalable, not matter what you do, at least if you _really_ want to do that. However, if what you're looking for is not to really compute all distances, (for instance if you're looking only for the top K nearest points), then it can be highly improved. It all depends of what you want to do eventually. Guillaume val locations = filelines.map(line = line.split(\t)).map(t = (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect() val cartesienProduct=locations.cartesian(locations).map(t= Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2))) Code executes perfectly fine uptill here but when i try to use cartesienProduct it got stuck i.e. val count =cartesienProduct.count() Any help to efficiently do this will be highly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: eXenSa] *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. http://www.exensa.com/ 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)184 163 677 / Fax +33(0)972 283 705 -- Regards, Muhammad Aamir *CONFIDENTIALITY:This email is intended solely for the person(s) named and may be confidential and/or privileged.If you are not the intended recipient,please delete it,notify me and do not copy,use,or disclose its content.* -- [image: eXenSa] *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. http://www.exensa.com/ 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)184 163 677 / Fax +33(0)972 283 705 -- Regards, Muhammad Aamir *CONFIDENTIALITY:This email is intended solely for the person(s) named and may be confidential and/or privileged.If you are not the intended recipient,please delete it,notify me and do not copy,use,or disclose its content.*
Re: Incremently load big RDD file into Memory
Hi Muhammad, There are lots of ways to do it. My company actually develops a text mining solution which embeds a very fast Approximate Neighbours solution (a demo with real time queries on the wikipedia dataset can be seen at wikinsights.org). For the record, we now prepare a dataset of 4.5 million documents for querying in about 2 or 3 minutes on a 32 cores cluster, and the queries take less than 10ms when the dataset is in memory. But if you just want to precompute everything and don't mind waiting a few tens of minutes (or hours), and don't want to bother with an approximate neighbour solution, then the best way is probably something like this : 1 - block your data (i.e. group your items in X large groups). Instead of a dataset of N elements, you should now have a dataset of X blocks containing N/X elements each. 2 - do the cartesian product (instead of N*N elements, you now have just X*X blocks, which should take less memory) 3 - for each pair of blocks (blockA,blockB), perform the computation of distances for each elements of blockA with each element of blockB, but keep only the top K best for each element of blockA. Output is List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..) 4 - reduceByKey (the key is the elementOfBlockA), by merging the listOfNearestElements and always keeping the K nearest. This is an exact version of top K. This is only interesting if K N/X. But even if K is large, it is possible that it will fit your needs. Remember that you will still compute N*N distances (this is the problem with exact nearest neighbours), the only difference with what you're doing now is that you produces less items and duplicates less data. Indeed, if one of your elements takes 100bytes, the per element cartesian will produce N*N*100*2 bytes, while the blocked version will produce X*X*100*2*N/X, ie X*N*100*2 bytes. Guillaume Hi Guillaume, Thanks for you reply. Can you please tell me how can i improve for Top-k nearest points. P.S. My post is not accepted on the list thats why i am sending you email here. I would be really grateful to you if you reply it. Thanks, On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel guillaume.pi...@exensa.com mailto:guillaume.pi...@exensa.com wrote: This kind of operation is not scalable, not matter what you do, at least if you _really_ want to do that. However, if what you're looking for is not to really compute all distances, (for instance if you're looking only for the top K nearest points), then it can be highly improved. It all depends of what you want to do eventually. Guillaume val locations = filelines.map(line = line.split(\t)).map(t = (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect() val cartesienProduct=locations.cartesian(locations).map(t= Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2))) Code executes perfectly fine uptill here but when i try to use cartesienProduct it got stuck i.e. val count =cartesienProduct.count() Any help to efficiently do this will be highly appreciated. -- View this message in context:http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail:user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail:user-h...@spark.apache.org mailto:user-h...@spark.apache.org -- eXenSa *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. http://www.exensa.com/ 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)184 163 677 / Fax +33(0)972 283 705 -- Regards, Muhammad Aamir /CONFIDENTIALITY:This email is intended solely for the person(s) named and may be confidential and/or privileged.If you are not the intended recipient,please delete it,notify me and do not copy,use,or disclose its content./ -- eXenSa *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. http://www.exensa.com/ 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)184 163 677 / Fax +33(0)972 283 705
RE: Incremently load big RDD file into Memory
cartesian is an expensive operation. If you have 'M' records in location, then locations. cartesian(locations) will generate MxM result.If locations is a big RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong Date: Tue, 7 Apr 2015 10:04:12 -0700 From: mas.ha...@gmail.com To: user@spark.apache.org Subject: Incremently load big RDD file into Memory val locations = filelines.map(line = line.split(\t)).map(t = (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect() val cartesienProduct=locations.cartesian(locations).map(t= Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2))) Code executes perfectly fine uptill here but when i try to use cartesienProduct it got stuck i.e. val count =cartesienProduct.count() Any help to efficiently do this will be highly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org