I posted a couple replies to a message from Linus in the thread titled 
"Questions 
on a Knowledge Representation Standard for AGI - Help me not waste my time 
:-)".  I wanted to move that discussion to a separate thread to avoid 
taking that one too far off of its original topic so I am starting this 
thread here. <https://groups.google.com/forum/#%21topic/opencog/rbSloyVg0VM>

I started expanding a bit on the idea I outlined there about making files 
filled with statements all on a particular topic.  I wanted to have 
something a bit more concrete to serve as an example of the kind of thing I 
want to create.  Using the example of the topic "breakfast" that I 
mentioned in the other thread I started a text file where I laid out a few 
statements about breakfast and then for each one I spent a minute or two 
sort of "riffing" on the basic idea of the sentence; building other similar 
sentences and replacing different words or phrases but keeping the same 
overall theme.  I will include the file as it exists now at the end of the 
email.  Bear in mind that the creation of this file only took a few minutes 
to put together, so a volunteer spending an hour or two on such a file 
would be able to create hundreds, if not thousands, of similar sentences on 
a subject.

As I understand it, the current thinking in the OpenCog project is to try 
to learn language by doing various statistical analyses large corpuses of 
un-annotated text.  I think something like this corpus would be ideal for 
the initial stages of that analysis.  Although the example file is not 
wikipedia scale large, it is much more tightly focused around a single idea 
than a general corpus like wikipedia is.  Additionally because there are 
many more repeated words, and the words are all used in a similar context, 
a much higher "weighting" of the patterns observed in the corpus is merited 
than in a typical body of text you would find in wikipedia or on a web 
page.  Although it will be orders of magnitude smaller than something like 
wikipedia, the "signal to noise ratio" will be much higher.

Also notice how in the text I have put one sentence on each line, but I 
have left a blank line between each block of sentences with a common 
theme.  If you just ignore these blank lines you get a more challenging 
learning problem than if you take note of them and "bias" the weightings 
assigned when learning from sentences within a block since you know that 
all of them express the same basic information just with different 
phraseology.  This lets you learn from just a couple sentences that "early 
in the morning" and "at the start of the day" likely have very similar, if 
not identical, meanings without having to parse hundreds of uses of these 
phrases.  You could also do something like putting a * at the beginning of 
a sentence that has a meaning opposite to the one before it in some sense.  
Again, you could either parse these and ignore the extra markings, or use 
the markings to influence the weighting of learned meaning.

This is what I mean when I say this this corpus will be a highly redundant 
body of text.  Although by word count it will end up being very large in 
comparison to wikipedia, it will be based around a comparatively small 
number of ideas/concepts but have a much more varied exploration of the 
language surrounding those concepts then you would find in an everyday body 
of text.

-AndrewBuck

Below is the example file:






Breakfast is a meal.

Breakfast is a small meal.
Breakfast is a light meal.
Breakfast is a simple meal.
Breakfast is usually something that is easy to make.
Breakfast is usually something that is easy to prepare.
Breakfast is usually something that is easy to cook.
Many breakfasts are foods you don't have to cook; like cereal, or an energy 
bar.

Breakfast is eaten in the morning.
Breakfast is a small meal eaten in the morning.
Breakfast is a small meal eaten early in the day.
Breakfast is a small meal eaten at the start of the day.
Breakfast is a small meal eaten after you wake up.
Breakfast is a small meal eaten soon after you wake up.
Breakfast is a small meal eaten just after you wake up.
Many people eat breakfast to start their day.
Many people start their day by eating breakfast.
Eating breakfast helps people wake up.
Eating breakfast helps people get going in the morning.

A common breakfast consists of toast and cereal.
A common breakfast consists of eggs and bacon.
A common breakfast consists of bacon and eggs.
A common breakfast consists of beans on toast.
A common breakfast consists of cereal and orange juice.
A common breakfast consists of pancakes.
A common breakfast consists of waffles.
A common breakfast consists of waffles or pancakes.
People commonly eat toast or cereal for breakfast.

Coffee, milk, or orange juice are common beverages served with breakfast.
People often drink coffee with their breakfast.
People drink coffee with their breakfast to wake them up.
<https://groups.google.com/forum/#%21topic/opencog/rbSloyVg0VM>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/a6f70380-603d-4482-a77c-7aa67b767a78%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to