Use max heap of all the distinct numbers and order them in decreasing order of their frequency.
On Sat, Oct 23, 2010 at 7:31 AM, Mridul Malpani <[email protected]>wrote: > use 2 datstructure : TRIE and an array of words along with their > frequency. > > we can solve it in following step: > > 1) scan each word of the large file. insert the current word into the > TRIE if not present already. update the frequency. > using TRIE, the time complexity is O(L), where l is the total no. > of characters in file. > > 2) now, you have build up the array of words with frequency. the size > of array = n = no. of words in file. > get k maximum frequency in O(nlgk). here k=10. ( use min heap of k > elements to get this.) > > On Oct 23, 8:19 am, Dave <[email protected]> wrote: > > @Ligerdave: Hey, the King James version of the Bible is only about > > 600,000 words. I use the Bible as an example only because it is a > > fairly large book. Maybe we are talking 10 megabytes to store the > > whole thing, seeing that there are some long words such as "Maher- > > shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes > > hardly seems "large," when compared to the 4 or 8 gigabytes or more of > > RAM on many computers. Besides, you don't have to keep all of the text > > in memory, but only the distinct words and an integer count of the > > number of occurrences. For the King James bible, this is less than > > 5,000 words, so we're talking a pretty minimal amount of memory. A > > hash table might work fine for this, or build a balanced binary tree > > of the words. After you have scanned all of the input text and > > determined the number of occurrences of each word, it is fairly easy > > to scan the word counts and pick out the ten largest. > > > > Dave > > > > On Oct 22, 9:24 am, ligerdave <[email protected]> wrote: > > > > > > > > > for a large file, you probably would want to use external sort. kinda > > > like a map-reduce concept. it's actually how sort&uniq kinda stuff > > > work in unix/linux when you try to find some "TOP X" > > > > > again, we are talking about the memory might not hold the entire file > > > > > On Oct 21, 9:35 am, "Vinay..." <[email protected]> wrote: > > > > > > how do u find 10 most repeating words on a large file containing > words > > > > in most efficient way...if it can also be done using heapsort plz > post > > > > ur answers..- Hide quoted text - > > > > > - Show quoted text - > > -- > You received this message because you are subscribed to the Google Groups > "Algorithm Geeks" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<algogeeks%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/algogeeks?hl=en. > > -- You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/algogeeks?hl=en.
