Stephen --

What do you know about Cyc's licensing terms?

Let's say that Novamente reads Cyc and then learns some things from it
... but then does not retain Cyc in its memory, but only some
"derivative knowledge" that is in quite different form...

Is this Novamente system then considered a derivative product of Cyc,
so that if an NM instance is licensed, the customer also has to
license Cyc?

thx
Ben


On 4/18/07, Stephen Reed <[EMAIL PROTECTED]> wrote:

Hi James,

My development source code is stored in the subversion repository at
SourceForge:  http://sf.net/projects/texai .  There is no GUI presently
because I am concentrating on the server-side functions and want text chat
as the system's primary communication modality.

Recently I created my own lexicon that was derived from OpenCyc, WordNet,
the CMU Pronouncing Dictionary and from a parsed Wiktionary.  I was using
MySQL as the database back end to contain these propositions and the
application performance suffered as the KB grew over 20 million propositions
(9 GB).  So I conducted some experiments that demonstated that Oracle
Berkelely DB Java Edition runs very fast for my application if the size of
its database is kept below 100,000 propositions.

While at Cycorp, I became familiar with their technique of physically
partitioning Cyc in order to excise old portions at a time when system RAM
was a limiting factor.  So I am accelerating a task that I planned for later
on - a peer-to-peer network that partitions the KB.  Because certain
reasonable partitions,  such as  context imported from WordNet, are over
100,000 propositions in size, I will use the database  sharding (table
slicing)  technique to  physically decompose  too-large  KB partitions.  As
a result, my current 27 million row MySQL database will be transformed into
approximately 10 partitions and approximately 300 shards.  For now I will
just run all the peers in the same JVM, as my development computer is an AMD
X2 5800 with 4 GB RAM running 64-bit Java 6.

Some other miscellaneous details:
1.  I revised all the KB terms such as symbols, variables, numbers,
named-terms and propositions, to be indexed by a 16-byte UUID instead of a
4-byte integer.  I had been keeping track of UUID for named terms, but now
the p2p system will run faster if the IDs are the same in all peers and do
not require translation to a local integer term ID.
2.  Although my java classes are compatible with J2EE, I am using my own
light-weight container for more rapid coding and testing.  Likewise the Java
Message Service (JMS) will be my peer-to-peer message transport interface,
and I will code a very simple implementation that works in a single JVM.
When I do go ahead and test a remote peer, for example with my old Window's
laptop, I'll probably use Apache ActiveMQ,  which is J2EE/JMS compatible.
3.  I have already integrated the CMU Speech Tools, and imported the CMU
Pronouncing Dictionary, so that when I get a text dialog system running, it
will be simple to add speech recognition and speech generation.  I have
hacked the CMU Sphinx speech recognition engine to enable the dialog system
to prune the n-best word list incrementally as each phoneme is processed,
according to discourse context.
4.  Unlike Cyc tradition, I am only creating atomic terms, not non-atomic
terms when naming new things, and I am creating only binary propositions
even though CycL allows for unary, binary, ternary, quaternary and quintary
(5- argument) propositions.  This is for efficiency, and for compatibility
with the Semantic Web - OWL is limited to atomic terms and binary
propositions.

The next step will be to resume work on the first Construction-Grammar
constructions, chosen to run my simple test cases and subsequently bootstrap
the creation of the many remaining constructions.  Some have commented here
on the failure of parsers to fully understand English text. Again, while at
Cycorp, I witnessed the utilization of many English parsers with Cyc: 1.
(in-house) simple template parser, 2. (in-house) recursive phrase structure
parser,  3.  (in-house)  head-driven phrase structure parser, 4 Stanford
parser,  5.  Charniak parser, 6 . Link -grammar parser.  The best of these
work great for simple unambiguous sentences.  Cyc post-processing can handle
some disambiguation within context.

 I am using Rodney Huddleston's "English Grammar, an outline" to enumerate
the required constructions for English.  When I investigated Construction
Grammar (c.f. Croft, Radical Construction Grammar), it appeared not only to
solve the deep understanding problem for complex sentences and idioms, but
it also filled the missing-parser gap in Walter Kintsch's
Construction-Integration approach to text comprehension.  So by coupling the
two, I hope to achieve deep English text understanding.  And because my
grammar is reversible, the same constructions (persisted as KB propositions)
will drive text generation.  The basic notion behind Construction Grammar
is to abandon an elegant, concise rule-based grammar, and instead use a
simple paring of form and meaning, where the forms are numerous, include
idioms and special cases, and only incidentally may be able to share
constituents.   I'll use Huddleston's text to indicate the required forms
and CycL to represent the meanings.  I plan to bootstrap the grammar
acquisition by hand-coding a dialog system that is designed to acquire more
form/meaning pairs.  Huddleston gives example phrases for all his identified
constructions and I'll create a test suite from these.

-Steve

----- Original Message ----
From: James Ratcliff <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, April 17, 2007 2:59:35 PM
Subject: Re: [agi] My proposal for an AGI agenda

Do you have any of this on the Net or usable form?  Or can post some good
screens of it?

James Ratcliff

Stephen Reed <[EMAIL PROTECTED]> wrote:

For my own AI research I am using Java. Apart from its satisfactory speed, I
like the NetBeans IDE, and most importantly like all the third-party
software libraries that I can plug in. Because my stuff is GPL, there really
is a wide variety of compatible software. For example, in the last 9 months
I built an object store to contain the OpenCyc ontology and then added
WordNet, the lexicon that I parsed from Wiktionary, and the CMU Pronouncing
Dictionary. All of this is to support a robust English dialog system that
will depend up a reversible construction grammar now under development. I
was able to plug in Hibernate and MySQL to host millions of knowledge base
propositions. Once I got above 20 million propositions, performance became
noticeably slower. So I am unplugging Hibernate and plugging in Oracle
Berkeley DB Java Edition (GPL compatible) and hope to regain ideal
performance by using a sharded (physically partitioned) object store.

For deployment I am using J2EE which is scalable from single box (where I'm
at now) to cluster to fully distributed.

Regarding self-modifying programs, I prefer that the system intelligently
compose its source code and then compile it. I already experimented with a
java classloader that can replace classes in a JVM on the fly.

I'm building the dialog system so that I can teach the system in English how
to do things and so not worry about the programming


 ________________________________
Ahhh...imagining that irresistible "new car" smell?
 Check out new cars at Yahoo! Autos. ________________________________
 This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?&;

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=231415&user_secret=fabd7936

Reply via email to