Lakshmi, this is orthogonal to your question, but in case it's useful.

It sounds like you're trying to determine the home location of a user, or
something similar.

If that's the problem statement, the data pattern may suggest a far more
computationally efficient approach. For example, first map all (lat,long)
pairs into geocells of a desired resolution (e.g., 10m or 100m), then count
occurrences of geocells instead. There are simple libraries to map any
(lat,long) pairs into a geocell (string) ID very efficiently.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Wed, Jun 4, 2014 at 3:49 AM, lmk <lakshmi.muralikrish...@gmail.com>
wrote:

> Hi,
> I am a new spark user. Pls let me know how to handle the following
> scenario:
>
> I have a data set with the following fields:
> 1. DeviceId
> 2. latitude
> 3. longitude
> 4. ip address
> 5. Datetime
> 6. Mobile application name
>
> With the above data, I would like to perform the following steps:
> 1. Collect all lat and lon for each ipaddress
>         (ip1,(lat1,lon1),(lat2,lon2))
>         (ip2,(lat3,lon3),(lat4,lat5))
> 2. For each IP,
>         1.Find the distance between each lat and lon coordinate pair and
> all
> the other pairs under the same IP
>         2.Select those coordinates whose distances fall under a specific
> threshold (say 100m)
>         3.Find the coordinate pair with the maximum occurrences
>
> In this case, how can I iterate and compare each coordinate pair with all
> the other pairs?
> Can this be done in a distributed manner, as this data set is going to have
> a few million records?
> Can we do this in map/reduce commands?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to