Re: [agi] My proposal for an AGI agenda

Stephen Reed Wed, 18 Apr 2007 12:21:10 -0700

Hi James,

My development source code is stored in the subversion repository at 
SourceForge:  http://sf.net/projects/texai .  There is no GUI presently because 
I am concentrating on the server-side functions and want text chat as the 
system's primary communication modality.


Recently I created my own lexicon that was derived from OpenCyc, WordNet, the 
CMU Pronouncing Dictionary and from a parsed Wiktionary.  I was using MySQL as 
the database back end to contain these propositions and the application 
performance suffered as the KB grew over 20 million propositions (9 GB).  So I 
conducted some experiments that demonstated that Oracle Berkelely DB Java 
Edition runs very fast for my application if the size of its database is kept 
below 100,000 propositions.  

While at Cycorp, I became familiar with their technique of physically 
partitioning Cyc in order to excise old portions at a time when system RAM was 
a limiting factor.  So I am accelerating a task that I planned for later on - a 
peer-to-peer network that partitions the KB.  Because certain reasonable 
partitions,  such as  context imported from WordNet, are over  100,000 
propositions in size, I will use the database  sharding (table slicing)  
technique to  physically decompose  too-large  KB partitions.  As a result, my 
current 27 million row MySQL database will be transformed into approximately 10 
partitions and approximately 300 shards.  For now I will just run all the peers 
in the same JVM, as my development computer is an AMD X2 5800 with 4 GB RAM 
running 64-bit Java 6.

Some other miscellaneous details:
1.  I revised all the KB terms such as symbols, variables, numbers, named-terms 
and propositions, to be indexed by a 16-byte UUID instead of a 4-byte integer.  
I had been keeping track of UUID for named terms, but now the p2p system will 
run faster if the IDs are the same in all peers and do not require translation 
to a local integer term ID.
2.  Although my java classes are compatible with J2EE, I am using my own 
light-weight container for more rapid coding and testing.  Likewise the Java 
Message Service (JMS) will be my peer-to-peer message transport interface, and 
I will code a very simple implementation that works in a single JVM.  When I do 
go ahead and test a remote peer, for example with my old Window's laptop, I'll 
probably use Apache ActiveMQ,  which is J2EE/JMS compatible.
3.  I have already integrated the CMU Speech Tools, and imported the CMU 
Pronouncing Dictionary, so that when I get a text dialog system running, it 
will be simple to add speech recognition and speech generation.  I have hacked 
the CMU Sphinx speech recognition engine to enable the dialog system to prune 
the n-best word list incrementally as each phoneme is processed, according to 
discourse context.
4.  Unlike Cyc tradition, I am only creating atomic terms, not non-atomic terms 
when naming new things, and I am creating only binary propositions even though 
CycL allows for unary, binary, ternary, quaternary and quintary (5- argument) 
propositions.  This is for efficiency, and for compatibility with the Semantic 
Web - OWL is limited to atomic terms and binary propositions.

The next step will be to resume work on the first Construction-Grammar 
constructions, chosen to run my simple test cases and subsequently bootstrap 
the creation of the many remaining constructions.  Some have commented here on 
the failure of parsers to fully understand English text. Again, while at 
Cycorp, I witnessed the utilization of many English parsers with Cyc: 1. 
(in-house) simple template parser, 2. (in-house) recursive phrase structure 
parser,  3.  (in-house)  head-driven phrase structure parser, 4 Stanford  
parser,  5.  Charniak parser, 6 . Link -grammar parser.  The best of these work 
great for simple unambiguous sentences.  Cyc post-processing can handle some 
disambiguation within context.  

 I am using Rodney Huddleston's "English Grammar, an outline" to enumerate the 
required constructions for English.  When I investigated Construction Grammar 
(c.f. Croft, Radical Construction Grammar), it appeared not only to solve the 
deep understanding problem for complex sentences and idioms, but it also filled 
the missing-parser gap in Walter Kintsch's Construction-Integration approach to 
text comprehension.  So by coupling the two, I hope to achieve deep English 
text understanding.  And because my grammar is reversible, the same 
constructions (persisted as KB propositions) will drive text generation.  The 
basic notion behind Construction Grammar  is to abandon an elegant, concise 
rule-based grammar, and instead use a simple paring of form and meaning, where 
the forms are numerous, include idioms and special cases, and only incidentally 
may be able to share constituents.   I'll use Huddleston's text to indicate the 
required forms and CycL to represent the meanings.  I
 plan to bootstrap the grammar acquisition by hand-coding a dialog system that 
is designed to acquire more form/meaning pairs.  Huddleston gives example 
phrases for all his identified constructions and I'll create a test suite from 
these.

-Steve

----- Original Message ----
From: James Ratcliff <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, April 17, 2007 2:59:35 PM
Subject: Re: [agi] My proposal for an AGI agenda

Do you have any of this on the Net or usable form?  Or can post some good 
screens of it?

James Ratcliff

Stephen Reed <[EMAIL PROTECTED]> wrote:
For my own AI research I am using Java.  Apart from its satisfactory speed, I 
like the NetBeans IDE, and most importantly like all the third-party software 
libraries that I can plug in.  Because my stuff is GPL, there really is a wide 
variety of compatible software.   For example, in the last 9 months I built an 
object store to contain the OpenCyc ontology and then added WordNet, the 
lexicon that I parsed from Wiktionary, and the CMU Pronouncing Dictionary.  All 
of this is to support a robust English dialog system that will depend up a 
reversible construction grammar now under development.   I was able to plug in 
Hibernate and MySQL to host millions of knowledge base propositions.  Once I 
got above 20 million propositions, performance became noticeably slower.  So  I 
am unplugging Hibernate and plugging in Oracle Berkeley DB Java Edition (GPL 
compatible) and hope to regain ideal performance by using a sharded (physically 
partitioned) object store.

For deployment I am using J2EE which is scalable from single box (where I'm at 
now) to cluster to fully distributed.

Regarding self-modifying programs, I prefer that the system intelligently 
compose its source code and then compile it.  I already experimented with a 
java classloader that can replace classes in a JVM on the fly.

I'm building the dialog system so that I can teach the system in English how to 
do things and so not worry about the programming 





__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=231415&user_secret=fabd7936

Re: [agi] My proposal for an AGI agenda

Reply via email to