Hi, I am a new spark user. Pls let me know how to handle the following scenario:
I have a data set with the following fields: 1. DeviceId 2. latitude 3. longitude 4. ip address 5. Datetime 6. Mobile application name With the above data, I would like to perform the following steps: 1. Collect all lat and lon for each ipaddress (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5)) 2. For each IP, 1.Find the distance between each lat and lon coordinate pair and all the other pairs under the same IP 2.Select those coordinates whose distances fall under a specific threshold (say 100m) 3.Find the coordinate pair with the maximum occurrences In this case, how can I iterate and compare each coordinate pair with all the other pairs? Can this be done in a distributed manner, as this data set is going to have a few million records? Can we do this in map/reduce commands? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html Sent from the Apache Spark User List mailing list archive at Nabble.com.