[opencog-dev] Building a corpus for language learning

Andrew Buck Tue, 04 Apr 2017 11:59:38 -0700

I posted a couple replies to a message from Linus in the thread titled 
"Questions 
on a Knowledge Representation Standard for AGI - Help me not waste my time 
:-)".  I wanted to move that discussion to a separate thread to avoid 
taking that one too far off of its original topic so I am starting this 
thread here. <https://groups.google.com/forum/#%21topic/opencog/rbSloyVg0VM>

I started expanding a bit on the idea I outlined there about making files
filled with statements all on a particular topic. I wanted to have
something a bit more concrete to serve as an example of the kind of thing I
want to create. Using the example of the topic "breakfast" that I
mentioned in the other thread I started a text file where I laid out a few
statements about breakfast and then for each one I spent a minute or two
sort of "riffing" on the basic idea of the sentence; building other similar
sentences and replacing different words or phrases but keeping the same
overall theme. I will include the file as it exists now at the end of the
email. Bear in mind that the creation of this file only took a few minutes
to put together, so a volunteer spending an hour or two on such a file
would be able to create hundreds, if not thousands, of similar sentences on
a subject.

As I understand it, the current thinking in the OpenCog project is to try
to learn language by doing various statistical analyses large corpuses of
un-annotated text. I think something like this corpus would be ideal for
the initial stages of that analysis. Although the example file is not
wikipedia scale large, it is much more tightly focused around a single idea
than a general corpus like wikipedia is. Additionally because there are
many more repeated words, and the words are all used in a similar context,
a much higher "weighting" of the patterns observed in the corpus is merited
than in a typical body of text you would find in wikipedia or on a web
page. Although it will be orders of magnitude smaller than something like
wikipedia, the "signal to noise ratio" will be much higher.

Also notice how in the text I have put one sentence on each line, but I
have left a blank line between each block of sentences with a common
theme. If you just ignore these blank lines you get a more challenging
learning problem than if you take note of them and "bias" the weightings
assigned when learning from sentences within a block since you know that
all of them express the same basic information just with different
phraseology. This lets you learn from just a couple sentences that "early
in the morning" and "at the start of the day" likely have very similar, if
not identical, meanings without having to parse hundreds of uses of these
phrases. You could also do something like putting a * at the beginning of
a sentence that has a meaning opposite to the one before it in some sense.
Again, you could either parse these and ignore the extra markings, or use
the markings to influence the weighting of learned meaning.

This is what I mean when I say this this corpus will be a highly redundant
body of text. Although by word count it will end up being very large in
comparison to wikipedia, it will be based around a comparatively small
number of ideas/concepts but have a much more varied exploration of the
language surrounding those concepts then you would find in an everyday body
of text.

-AndrewBuck

Below is the example file:

Breakfast is a meal.

Breakfast is a small meal.
Breakfast is a light meal.
Breakfast is a simple meal.
Breakfast is usually something that is easy to make.
Breakfast is usually something that is easy to prepare.
Breakfast is usually something that is easy to cook.
Many breakfasts are foods you don't have to cook; like cereal, or an energy
bar.

Breakfast is eaten in the morning.
Breakfast is a small meal eaten in the morning.
Breakfast is a small meal eaten early in the day.
Breakfast is a small meal eaten at the start of the day.
Breakfast is a small meal eaten after you wake up.
Breakfast is a small meal eaten soon after you wake up.
Breakfast is a small meal eaten just after you wake up.
Many people eat breakfast to start their day.
Many people start their day by eating breakfast.
Eating breakfast helps people wake up.
Eating breakfast helps people get going in the morning.

A common breakfast consists of toast and cereal.
A common breakfast consists of eggs and bacon.
A common breakfast consists of bacon and eggs.
A common breakfast consists of beans on toast.
A common breakfast consists of cereal and orange juice.
A common breakfast consists of pancakes.
A common breakfast consists of waffles.
A common breakfast consists of waffles or pancakes.
People commonly eat toast or cereal for breakfast.

Coffee, milk, or orange juice are common beverages served with breakfast.
People often drink coffee with their breakfast.
People drink coffee with their breakfast to wake them up.
<https://groups.google.com/forum/#%21topic/opencog/rbSloyVg0VM>

--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/a6f70380-603d-4482-a77c-7aa67b767a78%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Building a corpus for language learning

Reply via email to