Re: [algogeeks] Re: 10 most repeating words

preetika tyagi Sat, 23 Oct 2010 11:44:00 -0700

Use max heap of all the distinct numbers and order them in decreasing order
of their frequency.


On Sat, Oct 23, 2010 at 7:31 AM, Mridul Malpani <[email protected]>wrote:

> use 2 datstructure : TRIE and an array of words along with their
> frequency.
>
> we can solve it in following step:
>
> 1) scan each word of the large file. insert the current word into the
> TRIE if not present already. update the frequency.
>    using TRIE, the time complexity is O(L), where l is the total no.
> of characters in file.
>
> 2) now, you have build up the array of words with frequency. the size
> of array = n = no. of words in file.
>    get k maximum frequency in O(nlgk). here k=10. ( use min heap of k
> elements to get this.)
>
> On Oct 23, 8:19 am, Dave <[email protected]> wrote:
> > @Ligerdave: Hey, the King James version of the Bible is only about
> > 600,000 words. I use the Bible as an example only because it is a
> > fairly large book. Maybe we are talking 10 megabytes to store the
> > whole thing, seeing that there are some long words such as "Maher-
> > shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes
> > hardly seems "large," when compared to the 4 or 8 gigabytes or more of
> > RAM on many computers. Besides, you don't have to keep all of the text
> > in memory, but only the distinct words and an integer count of the
> > number of occurrences. For the King James bible, this is less than
> > 5,000 words, so we're talking a pretty minimal amount of memory. A
> > hash table might work fine for this, or build a balanced binary tree
> > of the words. After you have scanned all of the input text and
> > determined the number of occurrences of each word, it is fairly easy
> > to scan the word counts and pick out the ten largest.
> >
> > Dave
> >
> > On Oct 22, 9:24 am, ligerdave <[email protected]> wrote:
> >
> >
> >
> > > for a large file, you probably would want to use external sort. kinda
> > > like a map-reduce concept. it's actually how sort&uniq kinda stuff
> > > work in unix/linux when you try to find some "TOP X"
> >
> > > again, we are talking about the memory might not hold the entire file
> >
> > > On Oct 21, 9:35 am, "Vinay..." <[email protected]> wrote:
> >
> > > > how do u find 10 most repeating words on a large file containing
> words
> > > > in most efficient way...if it can also be done using heapsort plz
> post
> > > > ur answers..- Hide quoted text -
> >
> > > - Show quoted text -
>
> --
> You received this message because you are subscribed to the Google Groups
> "Algorithm Geeks" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<algogeeks%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/algogeeks?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

Re: [algogeeks] Re: 10 most repeating words

Reply via email to