[algogeeks] Re: 10 most repeating words

Dave Fri, 22 Oct 2010 20:19:16 -0700

@Ligerdave: Hey, the King James version of the Bible is only about
600,000 words. I use the Bible as an example only because it is a
fairly large book. Maybe we are talking 10 megabytes to store the
whole thing, seeing that there are some long words such as "Maher-
shalal-hash-baz," a name that occurs in Isaiah 8:3. Ten megabytes
hardly seems "large," when compared to the 4 or 8 gigabytes or more of
RAM on many computers. Besides, you don't have to keep all of the text
in memory, but only the distinct words and an integer count of the
number of occurrences. For the King James bible, this is less than
5,000 words, so we're talking a pretty minimal amount of memory. A
hash table might work fine for this, or build a balanced binary tree
of the words. After you have scanned all of the input text and
determined the number of occurrences of each word, it is fairly easy
to scan the word counts and pick out the ten largest.


Dave

On Oct 22, 9:24 am, ligerdave <[email protected]> wrote:
> for a large file, you probably would want to use external sort. kinda
> like a map-reduce concept. it's actually how sort&uniq kinda stuff
> work in unix/linux when you try to find some "TOP X"
>
> again, we are talking about the memory might not hold the entire file
>
> On Oct 21, 9:35 am, "Vinay..." <[email protected]> wrote:
>
>
>
> > how do u find 10 most repeating words on a large file containing words
> > in most efficient way...if it can also be done using heapsort plz post
> > ur answers..- Hide quoted text -
>
> - Show quoted text -

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

[algogeeks] Re: 10 most repeating words

Reply via email to