Re: Incremently load big RDD file into Memory

Guillaume Pitel Wed, 08 Apr 2015 11:56:00 -0700

Hi Muhammad,

There are lots of ways to do it. My company actually develops a textmining solution which embeds a very fast Approximate Neighbours solution(a demo with real time queries on the wikipedia dataset can be seen atwikinsights.org). For the record, we now prepare a dataset of 4.5million documents for querying in about 2 or 3 minutes on a 32 corescluster, and the queries take less than 10ms when the dataset is in memory.

But if you just want to precompute everything and don't mind waiting afew tens of minutes (or hours), and don't want to bother with anapproximate neighbour solution, then the best way is probably somethinglike this :

1 - block your data (i.e. group your items in X large groups). Insteadof a dataset of N elements, you should now have a dataset of X blockscontaining N/X elements each.2 - do the cartesian product (instead of N*N elements, you now have"just" X*X blocks, which should take less memory)3 - for each pair of blocks (blockA,blockB), perform the computation ofdistances for each elements of blockA with each element of blockB, butkeep only the top K best for each element of blockA. Output isList((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..)4 - reduceByKey (the key is the elementOfBlockA), by merging thelistOfNearestElements and always keeping the K nearest.

This is an exact version of top K. This is only interesting if K << N/X.But even if K is large, it is possible that it will fit your needs.Remember that you will still compute N*N distances (this is the problemwith exact nearest neighbours), the only difference with what you'redoing now is that you produces less items and duplicates less data.Indeed, if one of your elements takes 100bytes, the per elementcartesian will produce N*N*100*2 bytes, while the blocked version willproduce X*X*100*2*N/X, ie X*N*100*2 bytes.


Guillaume

Hi Guillaume,

Thanks for you reply. Can you please tell me how can i improve forTop-k nearest points.

P.S. My post is not accepted on the list thats why i am sending youemail here.

I would be really grateful to you if you reply it.
Thanks,

On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel<guillaume.pi...@exensa.com <mailto:guillaume.pi...@exensa.com>> wrote:


    This kind of operation is not scalable, not matter what you do, at
    least if you _really_ want to do that.

    However, if what you're looking for is not to really compute all
    distances, (for instance if you're looking only for the top K
    nearest points), then it can be highly improved.

    It all depends of what you want to do eventually.

    Guillaume

    val locations = filelines.map(line => line.split("\t")).map(t =>
    (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()

    val cartesienProduct=locations.cartesian(locations).map(t=>
    
Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))

    Code executes perfectly fine uptill here but when i try to use
    "cartesienProduct" it got stuck i.e.

    val count =cartesienProduct.count()

    Any help to efficiently do this will be highly appreciated.



    --
    View this message in 
context:http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
    Sent from the Apache Spark User List mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail:user-unsubscr...@spark.apache.org  
<mailto:user-unsubscr...@spark.apache.org>
    For additional commands, e-mail:user-h...@spark.apache.org  
<mailto:user-h...@spark.apache.org>

--eXenSa


        
    *Guillaume PITEL, Président*
    +33(0)626 222 431

    eXenSa S.A.S. <http://www.exensa.com/>
    41, rue Périer - 92120 Montrouge - FRANCE
    Tel +33(0)184 163 677 / Fax +33(0)972 283 705




--
Regards,
Muhammad Aamir

/CONFIDENTIALITY:This email is intended solely for the person(s) namedand may be confidential and/or privileged.If you are not the intendedrecipient,please delete it,notify me and do not copy,use,or discloseits content./



--
eXenSa

        
*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Re: Incremently load big RDD file into Memory

Reply via email to