Re: [agi] Fast Parsing in support of the coming Intelligent Internet

Logan Streondj Tue, 19 Mar 2013 07:00:38 -0700

On Mon, Mar 18, 2013 at 7:05 PM, Steve Richfield
<[email protected]>wrote:


> Logan,
>
> Interesting you noticed the simile between this and recursive
> ascent-descent parsing.
>
> Note that my comments below apply ONLY to NL, and **NOT** to computer
> languages like C. NL has orders of magnitude more variation in structure
> than computer languages, and it is those structural variations, acting in
> recursion, that have been the speed trap for NLP. If I were going to write
> a C compiler, I would NOT use the methods I am describing here.
>
> In a way, the simplest use of this IS a form of recursive ascent parsing,
> only the evaluations of every rule that is missing its least frequently
> used component (which is the VAST majority of rules) is omitted, and hence
> has no cost in machine time.
>
> in RADP this is default behavior anyways, it only looks for things if they
are required.


> The equivalent of the "descent" part is easily handled by having early
> rules set flags within appropriate scopes, to later be tested by
> lower-level rules. This allows unlimited zig-zagging, ascending up and
> descending down as needed to parse almost anything, as in recursive
> ascent-descent.
>
> table-driven and recursive approach are completely different.
table-driven simply is not and never was scalable, due to it's inherent
complexity and context-free nature.


> I suspect that the best implementation of this will involve coding rules
> as though they were to be executed in recursive ascent-descent fashion, but
> instead compiling those rules to actually executed as I describe.
>

that doesn't make sense,
since RADP code is faster by default than table driven code.



>
> All forms of parsing, including recursive ascent-descent, first require
> that every token be fully characterized - and there is no faster way than
> with the floating point method I described, in or out of hand-coded
> assembly language.
>
>
you don't know RADP, it doesn't require "full characterization" or any
really if you uses spaces between words. Tokenizing may be necessary for
languages that don't have spaces such as arabic, or perhaps Chinese, but
only to seperate them into words with spaces.

otherwise the words themselves are all that's necessary.


> Beyond that, there is still another 2-3 orders of magnitude to be gained
> by omitting the evaluation of >99% of all rules that would be evaluated by
> other methods that evaluate all rules that are in their recursive path,
> rather than ONLY the rules that are known to have their least likely
> criteria met.
>

See this is a major difference in RADP there is no "table of rules", so
there is no wasting time on it either.
There are simply a variety of functions that process components,
which only operate if they are requested, and find the component.


> To continue this discussion, I would have to understand how any
> competitive system that follows a recursive path that involves evaluating
> every rule in its recursive path, could possibly compete with a system that
> can summarily omit evaluating >99% of the rules?
>

because it doesn't evaluate such rules.
there are no parse-trees, or abstract do hickies,
it's plain and simple how humans would parse things.
read sentence, figure out verb, then the subject or object and any relevant
cases to pass to it as arguments.



>
> The "trick" to making my system work in the absence of following "a course
> through the rules", is the way the rules are juggled on the queues. This is
> NOT to make things run faster, but to have them work at all when >99% of
> the rules that would normally guide parsing are not even performed.
>
> Again, aside from not performing >99% of the rules, this is VERY similar
> to recursive ascent-descent parsing.
>
> Continuing...
> On Mon, Mar 18, 2013 at 1:59 PM, Logan Streondj <[email protected]>wrote:
>
>> I highly doubt it's any faster
>>
>
> I suspect that you haven't looked at it enough to clearly understand it.
> No, I won't force you to read the patent, but rather I stand ready to
> answer all questions.
>
>
>> or easier
>>
>
> Easier - clearly NOT. Simplicity is clearly NOT a virtue of my approach.
> However, with the right rules compiler, the greater complexity of my
> approach would be hidden from the rules programmers. Note that programming
> the rules for everyday English is a BIG job, probably requiring man-years
> of work by talented linguists, and hence worth the effort to write a good
> compiler to support their efforts.
>

Yet simplicity is essential for it to be scalable, and maintainable once it
gets to AGI levels.


>
>
>> to manage than recursive ascent descent parsing coded in register-machine
>> assembly.
>>
>
> When we are talking several orders of magnitude, the ~2:1 improvement by
> hand coding assembly is way down in the "noise". For most projects, the
> best approach seems to program them in high-level form, and then randomly
> stop the program a few hundred times to find out where the time is going,
> and finally hand code the ~100 lines that are consuming the majority of the
> time. I have done this on several projects, and every time my guesstimate
> as to where the time was going was WRONG!!! On one commercial compiler, I
> had thrown in a super-simplistic sequential search of the symbol table,
> rather than using other well known methods, only because I was in a big
> hurry when I coded it. However, on the random stops, the sequential search
> was only a couple percent of the total overhead, so I never even bothered
> replacing it. It subsequently served for years in commercial service,
> sequentially scanning its symbol table.
>
> Steve
>


I have a few arrays, they are also scanned sequentially, seperate ones, for
cases, types, and sentence indicators. I doubt it would be possible to make
it any faster by any other scanning method however.


As an aside, I don't believe it's necessary to have a computer understand
all the permutations of grammar, as each human speaker only uses a subset
of several thousand words and grammar styles.



> ====================
>
>> On Mon, Mar 18, 2013 at 4:29 PM, Steve Richfield <
>> [email protected]> wrote:
>>
>>> Get ready for an intelligent Internet, because here it comes...
>>> *
>>> The Problem*
>>>
>>> *
>>> *
>>>
>>> Why haven’t computers been able to understand plain English? There have
>>> been many attempts over the last 4 decades, and on careful examination they
>>> all seem to end the same way. Someone sees the complexity of English as
>>> being well within the reach of a good programmer, writes some working code,
>>> starts entering rules, and then they simultaneously hit two barriers:
>>>
>>>
>>>
>>>    1. While a few simple rules works on test cases, real-world English
>>>    is REALLY complicated, with enough exceptions to every rule that it would
>>>    take thousands of rules to be able to pick apart everyday English, and 
>>> many
>>>    of the rules are not at all simple. It would take years of work to write 
>>> a
>>>    “critical mass” of such rules, and lacking certain psychiatric 
>>> conditions,
>>>    these rules are not at all easy or fun to create. Of course, if a 
>>> developer
>>>    has a few million dollars, they could hire a team of linguists to start
>>>    writing these rules, but...
>>>    2. As developers enter even enough rules for a good demonstration,
>>>    their program starts to run SO slowly that they must back out some of 
>>> their
>>>    rules just to have their program respond in a timely manner. A little
>>>    research into the combinatorial nature of the problem shows that they are
>>>    at least a couple of orders of magnitude short on speed. Hence, they are
>>>    unmotivated to put together a company to create the needed rules, when
>>>    there is no computer capable of processing them.
>>>
>>> So, one by one, NLP developers have published some interesting examples
>>> of things their programs were able to do, without mentioning now long it
>>> took, or that it was hopeless to extend their methods to everyday English.
>>> No one wants to publish that their methods are unscalable, so these past
>>> efforts have simply faded away, without marking the trap waiting for the
>>> next NLP project.
>>>
>>>
>>> I have talked with several people who were writing yet another program
>>> to “understand” English, to try to save them from wasting years of their
>>> lives as have others before them, but they invariably just couldn’t believe
>>> that a modern gigahertz processor could ever be bogged down by seemingly
>>> simple string processing.
>>>
>>>
>>> This is the path that DrEliza.com was on. Looking to rewrite it, better
>>> programming offered a possible order of magnitude improvement in speed,
>>> which was still not enough to achieve the desired performance. However,
>>> instead of walking away from it as past NLP developers had done, I decided
>>> to determine just how fast it was conceivably possible to process NL, to
>>> see if this speed trap was theoretically unavoidable. In this process, I
>>> found a new technique that would probably be ~3-4 orders of magnitude
>>> faster than traditional methods, depending on just what it is compared with.
>>>
>>>
>>> Now, I have a way to avoid this speed trap, so people can start writing
>>> highly scalable NLP code.
>>>
>>>
>>> *The Solution*
>>>
>>> *
>>> *
>>>
>>> The patent for the details that any competent AI guru could apply to
>>> implement REALLY fast parsing, that operates orders of magnitude faster
>>> than prior art methods, and use it to make the Internet intelligent, are
>>> now embodied in U.S. Patent Application 13/836,678 that is attached to
>>> this posting. We are now working out the business details to encourage
>>> people to use this technology. Probably involved will be a users’ group, in
>>> which participation will earn enough credit toward future royalties that
>>> only medium and large sized corporations would end up paying anything.  
>>> Also,
>>> we are open for joint ventures, e.g. trading license to use this technology
>>> in return for founders’ stock. Earlier thoughts of simply granting
>>> exemptions from royalties, rather than granting credit, had some subtle
>>> legal problems and have been abandoned. If you think you see a better
>>> business approach, one good enough to get YOU involved, then please let me
>>> know.
>>>
>>>
>>> At considerable risk of summarizing 100 pages of legalese into a brief
>>> explanation...
>>>
>>>
>>> *Fast Parsing*
>>>
>>> *
>>> *
>>>
>>> Here it is in a nutshell:
>>>
>>>    - The input is parsed into tokens.
>>>    - The tokens are hashed into double precision floating point (DPFP)
>>>    numbers.
>>>    - A portion of the DPFP numbers are then used to access the English
>>>    lexicon in typical symbol table fashion, e.g. via a circular table, with
>>>    the usual collision handling, etc.
>>>    - During initialization, the first few thousand most commonly used
>>>    words, in order of frequency of use, are preprocessed to seed the 
>>> lexicon.
>>>    - Lexicon entries will contain the string that represents the word
>>>    (which is only needed for output), the DPFP hash for the word (used to
>>>    confirm that the correct entry has been found), and pointers to rules for
>>>    which the entry is the least frequently used word. Words will be
>>>    represented as an ordinal indicating the relative frequency of use, e.g.
>>>    “the” will be represented by 1.
>>>    - Rules are then compiled, during which time the least frequently
>>>    used words in the rules are identified and marked in the lexicon to 
>>> trigger
>>>    the queuing of those rules.
>>>    - Higher-level rules will be queued as lower-level rules are
>>>    satisfied.
>>>    - As input words are processed, the rules that are triggered will be
>>>    put into appropriate queues. The next rule processed will always come 
>>> from
>>>    the highest priority non-empty queue. Higher-level rules will go into
>>>    lower-priority queues.
>>>    - When the last queue is empty, output can then be retrieved from
>>>    variables set by the rules.
>>>
>>> This method removes the usual scope constraints that most NL processing
>>> methods have, so for example, it will be possible to disambiguate
>>> abbreviations and idioms based on words that occur elsewhere in the same
>>> sentence, paragraph, posting, or prior posting by the same user, without
>>> incurring significant additional overhead.
>>>
>>>
>>> In other methods of parsing NL, >99% of all tests fail to find what they
>>> are seeking. In this method a large fraction, approaching half, will find
>>> what they are looking for because they aren’t performed unless the least
>>> likely element is present. Note that any rule accessing the results of a
>>> lower-level rule that hasn’t been evaluated simply assumes that its result
>>> is false, which it would be if it were to be evaluated, because its least
>>> likely necessary element MUST be absent for it not to have previously been
>>> evaluated.
>>>
>>>
>>> Note that this approach is a method of running really fast. It is NOT a
>>> particular parsing methodology. You can invent rules of any kind to parse
>>> text in any way you imagine. This is just a way of putting it all together
>>> to run really fast.
>>>
>>>
>>> Note that this same sort of selective processing based on the appearance
>>> of least likely elements has all sorts of other applications not mentioned
>>> in the patent, e.g. in AGI internals, so it is important to grok how this
>>> approach so greatly speeds things up.
>>>
>>>
>>> *Intelligent Internet*
>>>
>>> *
>>> *
>>>
>>> Fast parsing makes it possible to keep up with everything on the
>>> Internet in real time. My plan is to have an AI synthetic user watch
>>> everything, making it the most active user on the Internet, and comment on
>>> things that it can usefully comment on. There are lots of “little” details
>>> that will have to be worked out to make this a reality, including:
>>>
>>>
>>>
>>>    1. A mechanism of sending emails through human representatives for
>>>    review and onto the ultimate recipients, in a way that the FROM is 
>>> altered
>>>    to be the human representative, the TO is altered to be the ultimate
>>>    recipient, and all markings indicating this processing are removed.
>>>    2. A mechanism for web crawlers to work through human
>>>    representatives’ computers to hide their activity in some sensitive 
>>> domains.
>>>
>>> In addition to implementing a synthetic user, this supports on-the-fly
>>> custom tailoring ads to address recipients' postings, as well as
>>> traditional expert sites like the present DrEiza.com.
>>>
>>>
>>> *Where this Really Shines*
>>>
>>> *
>>> *
>>>
>>> Clearly the best fit between technology and leverage is in political,
>>> religious, and other contentious issues where everyone has an opinion, but
>>> few opinions have been well thought out. Here, an AI can easily see the
>>> common expressions indicating simple flaws in people’s rantings, and tailor
>>> responses that strike at the very heart of those flaws – for a price of
>>> course. This should be able to grab much of the money now going to
>>> political advertising, because it can touch people’s individual points of
>>> view right as they are expressing them.
>>>
>>>
>>> *The Future*
>>>
>>> *
>>> *
>>>
>>> Other methods of parsing NL will soon be abandoned in the face of having
>>> this method available. Unlike other computer-related tools and technology
>>> that becomes obsolete when the next version is released, this is likely to
>>> be around for a while. After all, it took 40 years of NLP stumbling along
>>> for someone to think of this, so how long is it going to take to come up
>>> with something significantly better?
>>>
>>>
>>> *Special Thanks*
>>>
>>> *
>>> *
>>>
>>> Special thanks to the technical reviewers who helped make this possible.
>>> You are hereby released from your NDAs and are free to discuss all you now
>>> know.
>>>
>>>
>>> In case you are interested in more details, or just want to see what
>>> such a patent looks like after the lawyers have finished with it, I have
>>> attached the patent abstract, specification, and drawings to this message.
>>> It is a bit big and may not remain on the server, so I recommend that you
>>> copy it off and save it on your own computer.
>>>
>>>
>>> Steve
>>>
>>>
>>>    *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
>>> <https://www.listbox.com/member/archive/rss/303/5037279-a88c7a6d> |
>>> Modify <https://www.listbox.com/member/?&;> Your Subscription
>>> <http://www.listbox.com>
>>>
>>
>>    *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
>> <https://www.listbox.com/member/archive/rss/303/10443978-6f4c28ac> |
>> Modify <https://www.listbox.com/member/?&;> Your Subscription
>> <http://www.listbox.com>
>>
>
>    *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
> <https://www.listbox.com/member/archive/rss/303/5037279-a88c7a6d> | 
> Modify<https://www.listbox.com/member/?&;>Your Subscription
> <http://www.listbox.com>
>



-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Re: [agi] Fast Parsing in support of the coming Intelligent Internet

Reply via email to