Hi Muhammad,

There are lots of ways to do it. My company actually develops a text mining solution which embeds a very fast Approximate Neighbours solution (a demo with real time queries on the wikipedia dataset can be seen at wikinsights.org). For the record, we now prepare a dataset of 4.5 million documents for querying in about 2 or 3 minutes on a 32 cores cluster, and the queries take less than 10ms when the dataset is in memory.

But if you just want to precompute everything and don't mind waiting a few tens of minutes (or hours), and don't want to bother with an approximate neighbour solution, then the best way is probably something like this :

1 - block your data (i.e. group your items in X large groups). Instead of a dataset of N elements, you should now have a dataset of X blocks containing N/X elements each. 2 - do the cartesian product (instead of N*N elements, you now have "just" X*X blocks, which should take less memory) 3 - for each pair of blocks (blockA,blockB), perform the computation of distances for each elements of blockA with each element of blockB, but keep only the top K best for each element of blockA. Output is List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..) 4 - reduceByKey (the key is the elementOfBlockA), by merging the listOfNearestElements and always keeping the K nearest.

This is an exact version of top K. This is only interesting if K << N/X. But even if K is large, it is possible that it will fit your needs. Remember that you will still compute N*N distances (this is the problem with exact nearest neighbours), the only difference with what you're doing now is that you produces less items and duplicates less data. Indeed, if one of your elements takes 100bytes, the per element cartesian will produce N*N*100*2 bytes, while the blocked version will produce X*X*100*2*N/X, ie X*N*100*2 bytes.

Guillaume
Hi Guillaume,

Thanks for you reply. Can you please tell me how can i improve for Top-k nearest points.

P.S. My post is not accepted on the list thats why i am sending you email here.
I would be really grateful to you if you reply it.
Thanks,

On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel <guillaume.pi...@exensa.com <mailto:guillaume.pi...@exensa.com>> wrote:

    This kind of operation is not scalable, not matter what you do, at
    least if you _really_ want to do that.

    However, if what you're looking for is not to really compute all
    distances, (for instance if you're looking only for the top K
    nearest points), then it can be highly improved.

    It all depends of what you want to do eventually.

    Guillaume
    val locations = filelines.map(line => line.split("\t")).map(t =>
    (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()

    val cartesienProduct=locations.cartesian(locations).map(t=>
    
Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))

    Code executes perfectly fine uptill here but when i try to use
    "cartesienProduct" it got stuck i.e.

    val count =cartesienProduct.count()

    Any help to efficiently do this will be highly appreciated.



    --
    View this message in 
context:http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
    Sent from the Apache Spark User List mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail:user-unsubscr...@spark.apache.org  
<mailto:user-unsubscr...@spark.apache.org>
    For additional commands, e-mail:user-h...@spark.apache.org  
<mailto:user-h...@spark.apache.org>



-- eXenSa

        
    *Guillaume PITEL, Président*
    +33(0)626 222 431

    eXenSa S.A.S. <http://www.exensa.com/>
    41, rue Périer - 92120 Montrouge - FRANCE
    Tel +33(0)184 163 677 / Fax +33(0)972 283 705




--
Regards,
Muhammad Aamir


/CONFIDENTIALITY:This email is intended solely for the person(s) named and may be confidential and/or privileged.If you are not the intended recipient,please delete it,notify me and do not copy,use,or disclose its content./


--
eXenSa

        
*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Reply via email to