It is possible if you use a cartesian product to produce all possible pairs for each IP address and 2 stages of map-reduce: - first by pairs of points to find the total of each pair and - second by IP address to find the pair for each IP address with the maximum count.
Oleg On 4 June 2014 11:49, lmk <lakshmi.muralikrish...@gmail.com> wrote: > Hi, > I am a new spark user. Pls let me know how to handle the following > scenario: > > I have a data set with the following fields: > 1. DeviceId > 2. latitude > 3. longitude > 4. ip address > 5. Datetime > 6. Mobile application name > > With the above data, I would like to perform the following steps: > 1. Collect all lat and lon for each ipaddress > (ip1,(lat1,lon1),(lat2,lon2)) > (ip2,(lat3,lon3),(lat4,lat5)) > 2. For each IP, > 1.Find the distance between each lat and lon coordinate pair and > all > the other pairs under the same IP > 2.Select those coordinates whose distances fall under a specific > threshold (say 100m) > 3.Find the coordinate pair with the maximum occurrences > > In this case, how can I iterate and compare each coordinate pair with all > the other pairs? > Can this be done in a distributed manner, as this data set is going to have > a few million records? > Can we do this in map/reduce commands? > > Thanks. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Kind regards, Oleg