Re: [agi] Making sense of Zipf's Law

J Rao Mon, 14 Dec 2015 17:40:33 -0800

Is there a reason we couldn't just measure the frequency using a big corpus?


On 12/15/2015 3:33 AM, Steve Richfield wrote:

Hi,
Just to make sure we are starting on the same page, see the Wikipediaarticle about Zipf's law at:
https://en.wikipedia.org/wiki/Zipf's_law<https://en.wikipedia.org/wiki/Zipf%27s_law>
In summary, this provides a formula to convert word ranking intoapproximate frequency of occurrence, which is VERY useful inidentifying least frequently used words to trigger processing, etc.
Whatever formula someone might consider should sum to 1.0 over aninfinite list of ranked words, as each word in a text appearsSOMEWHERE in a ranking. However in reality, the story is more complex.
Looking at words in Wikipedia, frequency goes as 0.07/N (which doesNOT converge for an infinite list of words) out to 10,000 or so, andthen drops off considerably more rapidly so that the millionth-rankedword is nearly 2 orders of magnitude less frequent than it would ifthe linear relationship had continued. Apparently no one has (yet)done the math to fit this to SOMETHING that converges to a totalfrequency of 1.0.
I just HATE non-converging series.
Note that a simple formula that fits the ENTIRE Wikipedia curve can behad by simply substituting the formula 700/(N^2) for N>10^4
OK, so where does the magic 10,000 come from? THAT appears to be ourbasic vocabulary, beyond which various subgroups add their ownspecialized vocabularies, explaining the rapid drop-off after 10,000words. A corpus other than Wikipedia that is an amalgamation of manydisparate subjects would doubtless have a very different "curve" outbeyond 10,000. It looks to me like the 3,000 word basic vocabularypicked the wrong number - they should have gone for 10,000 words.
This seems to also say a lot about language granularity - how finelywe presume the construction of our universe to be. For those who thinkwe are in some sort of simulation, this might say something about theprecision of such a simulation, etc.
This seems to also say a lot about how much would be needed by anAI/AGI text "understanding" system - "understanding" somewhere beyond10^4 words to be broadly useful.
Anyway - I saw some wisdom in these numbers, along with somemathematical shortfalls in the associated formulas that someone needsto be turn into equations that sum to 1.0
Thoughts?

/Steve/
*AGI* | Archives <https://www.listbox.com/member/archive/303/=now><https://www.listbox.com/member/archive/rss/303/26346070-1cd82ca6> |Modify<https://www.listbox.com/member/?&;>Your Subscription [Powered by Listbox] <http://www.listbox.com>





-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Re: [agi] Making sense of Zipf's Law

Reply via email to