On Sep 4, 10:12 am, "Huabin Zheng" <[EMAIL PROTECTED]> wrote:
> Hi all,
> I am encountered with a problem, it looks like this:
>
> There is a log file which records all the IPs that visited a certain web
> site. The log file may be several G bytes, but the computer used to analyze
> it has limited memory, about 1G bytes. I am asked to figure out the Top K
> IPs which visited the web site most most frequently.
> is hash table competent to solve it?
>
> Any other suggestions? Or are there classic algorithms existed to cope with
> it?
Any datastructure are going to consume at least O(n) memory for the
number of IPs that visited the site.
The question will be how many IPs there are. I see a classical
database solution to keep count of each IP's visit.
The other option I'm thinking of would involve a disk based mergesort
type algorithm where you expire LRU entries from RAM to disk as RAM
fills up beyond the available memory limits, and then to use that as
summary type data to be rehashed etc. in repeating fashion. that you
do a sort on and then merge those as you reread them into memory.
>
> thanks
>
> Regards,
> Huabin
>
> --
> Huabin Zheng
> Sensor Networks and Application Research Center, GUCAS
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"google-codejam" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/google-code?hl=en
-~----------~----~----~----~------~----~------~--~---