Re: [agi] Making sense of Zipf's Law

J Rao Mon, 14 Dec 2015 23:34:19 -0800

Right, the system would need some way to deal with words not in itsvocabulary (which I assume would always be limited initially). I thinkthe standard practice is to replace all such words with a unknown wordtoken, or better yet try to infer its meaning based on the words around it.


On 12/15/2015 1:19 PM, Steve Richfield wrote:

On Mon, Dec 14, 2015 at 5:39 PM, J Rao <[email protected]<mailto:[email protected]>> wrote:
    Is there a reason we couldn't just measure the frequency using a
    big corpus?
Yea. To do that you must process the big corpus, which requires thisinformation in advance to make the processing go faster. In short,this creates a sort of chicken-or-egg problem. Also, the frequenciesCHANGE with time as interests wax and wane.
Remember - it takes MANY occurrences of a word to establish itsfrequency with any accuracy.
Note that the Wikipedia article I mentioned processed the entirety ofWikipedia. You might notice the little dashes at the bottom of the redline. Those come from words that occur just once in all of Wikipedia -probably just spelling errors.
For really rare words, like neologies, there probably aren't enoughoccurrences on the entire Internet to establish frequency fromobservation.
Fortunately, all **I** need to be able to do is compare thefrequencies of short lists of words with the frequencies of othershort lists of words, which hopefully won't be particularly sensitiveto the effects I have been discussing. Even if there is an "error" insuch a comparison, it would be between nearly equally occurring lists,so there would be little lost, other than a few milliseconds ofcomputer time.
/Steve/
//
===============

    On 12/15/2015 3:33 AM, Steve Richfield wrote:

        Hi,

        Just to make sure we are starting on the same page, see the
        Wikipedia article about Zipf's law at:

        https://en.wikipedia.org/wiki/Zipf's_law
        <https://en.wikipedia.org/wiki/Zipf%27s_law>
        <https://en.wikipedia.org/wiki/Zipf%27s_law>

        In summary, this provides a formula to convert word ranking
        into approximate frequency of occurrence, which is VERY useful
        in identifying least frequently used words to trigger
        processing, etc.

        Whatever formula someone might consider should sum to 1.0 over
        an infinite list of ranked words, as each word in a text
        appears SOMEWHERE in a ranking. However in reality, the story
        is more complex.

        Looking at words in Wikipedia, frequency goes as 0.07/N (which
        does NOT converge for an infinite list of words) out to 10,000
        or so, and then drops off considerably more rapidly so that
        the millionth-ranked word is nearly 2 orders of magnitude less
        frequent than it would if the linear relationship had
        continued. Apparently no one has (yet) done the math to fit
        this to SOMETHING that converges to a total frequency of 1.0.

        I just HATE non-converging series.

        Note that a simple formula that fits the ENTIRE Wikipedia
        curve can be had by simply substituting the formula 700/(N^2)
        for N>10^4

        OK, so where does the magic 10,000 come from? THAT appears to
        be our basic vocabulary, beyond which various subgroups add
        their own specialized vocabularies, explaining the rapid
        drop-off after 10,000 words. A corpus other than Wikipedia
        that is an amalgamation of many disparate subjects would
        doubtless have a very different "curve" out beyond 10,000. It
        looks to me like the 3,000 word basic vocabulary picked the
        wrong number - they should have gone for 10,000 words.

        This seems to also say a lot about language granularity - how
        finely we presume the construction of our universe to be. For
        those who think we are in some sort of simulation, this might
        say something about the precision of such a simulation, etc.

        This seems to also say a lot about how much would be needed by
        an AI/AGI text "understanding" system - "understanding"
        somewhere beyond 10^4 words to be broadly useful.

        Anyway - I saw some wisdom in these numbers, along with some
        mathematical shortfalls in the associated formulas that
        someone needs to be turn into equations that sum to 1.0

        Thoughts?

        /Steve/
        *AGI* | Archives
        <https://www.listbox.com/member/archive/303/=now>
        <https://www.listbox.com/member/archive/rss/303/26346070-1cd82ca6>
        | Modify <https://www.listbox.com/member/?&;> Your
        Subscription    [Powered by Listbox] <http://www.listbox.com>





    -------------------------------------------
    AGI
    Archives: https://www.listbox.com/member/archive/303/=now
    RSS Feed:
    https://www.listbox.com/member/archive/rss/303/10443978-6f4c28ac
    Modify Your Subscription: https://www.listbox.com/member/?&;
    Powered by Listbox: http://www.listbox.com




--
Full employment can be had with the stoke of a pen. Simply institute asix hour workday. That will easily create enough new jobs to bringback full employment.
*AGI* | Archives <https://www.listbox.com/member/archive/303/=now><https://www.listbox.com/member/archive/rss/303/26346070-1cd82ca6> |Modify<https://www.listbox.com/member/?&;>Your Subscription [Powered by Listbox] <http://www.listbox.com>





-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Re: [agi] Making sense of Zipf's Law

Reply via email to