Stephen -- What do you know about Cyc's licensing terms?
Let's say that Novamente reads Cyc and then learns some things from it ... but then does not retain Cyc in its memory, but only some "derivative knowledge" that is in quite different form... Is this Novamente system then considered a derivative product of Cyc, so that if an NM instance is licensed, the customer also has to license Cyc? thx Ben On 4/18/07, Stephen Reed <[EMAIL PROTECTED]> wrote:
Hi James, My development source code is stored in the subversion repository at SourceForge: http://sf.net/projects/texai . There is no GUI presently because I am concentrating on the server-side functions and want text chat as the system's primary communication modality. Recently I created my own lexicon that was derived from OpenCyc, WordNet, the CMU Pronouncing Dictionary and from a parsed Wiktionary. I was using MySQL as the database back end to contain these propositions and the application performance suffered as the KB grew over 20 million propositions (9 GB). So I conducted some experiments that demonstated that Oracle Berkelely DB Java Edition runs very fast for my application if the size of its database is kept below 100,000 propositions. While at Cycorp, I became familiar with their technique of physically partitioning Cyc in order to excise old portions at a time when system RAM was a limiting factor. So I am accelerating a task that I planned for later on - a peer-to-peer network that partitions the KB. Because certain reasonable partitions, such as context imported from WordNet, are over 100,000 propositions in size, I will use the database sharding (table slicing) technique to physically decompose too-large KB partitions. As a result, my current 27 million row MySQL database will be transformed into approximately 10 partitions and approximately 300 shards. For now I will just run all the peers in the same JVM, as my development computer is an AMD X2 5800 with 4 GB RAM running 64-bit Java 6. Some other miscellaneous details: 1. I revised all the KB terms such as symbols, variables, numbers, named-terms and propositions, to be indexed by a 16-byte UUID instead of a 4-byte integer. I had been keeping track of UUID for named terms, but now the p2p system will run faster if the IDs are the same in all peers and do not require translation to a local integer term ID. 2. Although my java classes are compatible with J2EE, I am using my own light-weight container for more rapid coding and testing. Likewise the Java Message Service (JMS) will be my peer-to-peer message transport interface, and I will code a very simple implementation that works in a single JVM. When I do go ahead and test a remote peer, for example with my old Window's laptop, I'll probably use Apache ActiveMQ, which is J2EE/JMS compatible. 3. I have already integrated the CMU Speech Tools, and imported the CMU Pronouncing Dictionary, so that when I get a text dialog system running, it will be simple to add speech recognition and speech generation. I have hacked the CMU Sphinx speech recognition engine to enable the dialog system to prune the n-best word list incrementally as each phoneme is processed, according to discourse context. 4. Unlike Cyc tradition, I am only creating atomic terms, not non-atomic terms when naming new things, and I am creating only binary propositions even though CycL allows for unary, binary, ternary, quaternary and quintary (5- argument) propositions. This is for efficiency, and for compatibility with the Semantic Web - OWL is limited to atomic terms and binary propositions. The next step will be to resume work on the first Construction-Grammar constructions, chosen to run my simple test cases and subsequently bootstrap the creation of the many remaining constructions. Some have commented here on the failure of parsers to fully understand English text. Again, while at Cycorp, I witnessed the utilization of many English parsers with Cyc: 1. (in-house) simple template parser, 2. (in-house) recursive phrase structure parser, 3. (in-house) head-driven phrase structure parser, 4 Stanford parser, 5. Charniak parser, 6 . Link -grammar parser. The best of these work great for simple unambiguous sentences. Cyc post-processing can handle some disambiguation within context. I am using Rodney Huddleston's "English Grammar, an outline" to enumerate the required constructions for English. When I investigated Construction Grammar (c.f. Croft, Radical Construction Grammar), it appeared not only to solve the deep understanding problem for complex sentences and idioms, but it also filled the missing-parser gap in Walter Kintsch's Construction-Integration approach to text comprehension. So by coupling the two, I hope to achieve deep English text understanding. And because my grammar is reversible, the same constructions (persisted as KB propositions) will drive text generation. The basic notion behind Construction Grammar is to abandon an elegant, concise rule-based grammar, and instead use a simple paring of form and meaning, where the forms are numerous, include idioms and special cases, and only incidentally may be able to share constituents. I'll use Huddleston's text to indicate the required forms and CycL to represent the meanings. I plan to bootstrap the grammar acquisition by hand-coding a dialog system that is designed to acquire more form/meaning pairs. Huddleston gives example phrases for all his identified constructions and I'll create a test suite from these. -Steve ----- Original Message ---- From: James Ratcliff <[EMAIL PROTECTED]> To: [email protected] Sent: Tuesday, April 17, 2007 2:59:35 PM Subject: Re: [agi] My proposal for an AGI agenda Do you have any of this on the Net or usable form? Or can post some good screens of it? James Ratcliff Stephen Reed <[EMAIL PROTECTED]> wrote: For my own AI research I am using Java. Apart from its satisfactory speed, I like the NetBeans IDE, and most importantly like all the third-party software libraries that I can plug in. Because my stuff is GPL, there really is a wide variety of compatible software. For example, in the last 9 months I built an object store to contain the OpenCyc ontology and then added WordNet, the lexicon that I parsed from Wiktionary, and the CMU Pronouncing Dictionary. All of this is to support a robust English dialog system that will depend up a reversible construction grammar now under development. I was able to plug in Hibernate and MySQL to host millions of knowledge base propositions. Once I got above 20 million propositions, performance became noticeably slower. So I am unplugging Hibernate and plugging in Oracle Berkeley DB Java Edition (GPL compatible) and hope to regain ideal performance by using a sharded (physically partitioned) object store. For deployment I am using J2EE which is scalable from single box (where I'm at now) to cluster to fully distributed. Regarding self-modifying programs, I prefer that the system intelligently compose its source code and then compile it. I already experimented with a java classloader that can replace classes in a JVM on the fly. I'm building the dialog system so that I can teach the system in English how to do things and so not worry about the programming ________________________________ Ahhh...imagining that irresistible "new car" smell? Check out new cars at Yahoo! Autos. ________________________________ This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/?&
----- This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/?member_id=231415&user_secret=fabd7936
