Hi Muhammad,
There are lots of ways to do it. My company actually develops a text
mining solution which embeds a very fast Approximate Neighbours solution
(a demo with real time queries on the wikipedia dataset can be seen at
wikinsights.org). For the record, we now prepare a dataset of 4.5
million documents for querying in about 2 or 3 minutes on a 32 cores
cluster, and the queries take less than 10ms when the dataset is in memory.
But if you just want to precompute everything and don't mind waiting a
few tens of minutes (or hours), and don't want to bother with an
approximate neighbour solution, then the best way is probably something
like this :
1 - block your data (i.e. group your items in X large groups). Instead
of a dataset of N elements, you should now have a dataset of X blocks
containing N/X elements each.
2 - do the cartesian product (instead of N*N elements, you now have
"just" X*X blocks, which should take less memory)
3 - for each pair of blocks (blockA,blockB), perform the computation of
distances for each elements of blockA with each element of blockB, but
keep only the top K best for each element of blockA. Output is
List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..)
4 - reduceByKey (the key is the elementOfBlockA), by merging the
listOfNearestElements and always keeping the K nearest.
This is an exact version of top K. This is only interesting if K << N/X.
But even if K is large, it is possible that it will fit your needs.
Remember that you will still compute N*N distances (this is the problem
with exact nearest neighbours), the only difference with what you're
doing now is that you produces less items and duplicates less data.
Indeed, if one of your elements takes 100bytes, the per element
cartesian will produce N*N*100*2 bytes, while the blocked version will
produce X*X*100*2*N/X, ie X*N*100*2 bytes.
Guillaume
Hi Guillaume,
Thanks for you reply. Can you please tell me how can i improve for
Top-k nearest points.
P.S. My post is not accepted on the list thats why i am sending you
email here.
I would be really grateful to you if you reply it.
Thanks,
On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel
<guillaume.pi...@exensa.com <mailto:guillaume.pi...@exensa.com>> wrote:
This kind of operation is not scalable, not matter what you do, at
least if you _really_ want to do that.
However, if what you're looking for is not to really compute all
distances, (for instance if you're looking only for the top K
nearest points), then it can be highly improved.
It all depends of what you want to do eventually.
Guillaume
val locations = filelines.map(line => line.split("\t")).map(t =>
(t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()
val cartesienProduct=locations.cartesian(locations).map(t=>
Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))
Code executes perfectly fine uptill here but when i try to use
"cartesienProduct" it got stuck i.e.
val count =cartesienProduct.count()
Any help to efficiently do this will be highly appreciated.
--
View this message in
context:http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail:user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail:user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>
--
eXenSa
*Guillaume PITEL, Président*
+33(0)626 222 431
eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705
--
Regards,
Muhammad Aamir
/CONFIDENTIALITY:This email is intended solely for the person(s) named
and may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose
its content./
--
eXenSa
*Guillaume PITEL, Président*
+33(0)626 222 431
eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705