[opencog-dev] Re: Chinese language learning pipeline

Linas Vepstas Thu, 13 Jul 2017 01:08:16 -0700

OK,I guess I still didn't explain enough. Here's a longer explanation.

So, for the case of word boundaries, you have sequences of characters,
hanzi, that are next to each other.  For circumpositions, they do NOT have
to be next to each other: there might be many words in between.


This is why I was trying to talk about sheaves: a sheaf tells you how to
handle the connectivity of circumpositions, where there are "holes" in
between the blocks.  So again:  imagine a single disjunct, all by itself.
This contains only local connections.  But two of them together still look
like a section of a sheaf.  In diagrams:

Here is a single disjunct:

         +Cs+
         |  |
         if ?

Think of it as being anchored by the single germ/gerbe "if" from one
sentence:

    +------->WV------->+--MVs-+---CV->+
    +--Wd--+-Sp*i+--I--+Osm+  +Cs+-Sp-+--O-+
    |      |     |     |   |  |  |    |    |
LEFT-WALL I.p will.v do.v it if you say.v so

Here is one section of a sheaf whose germ/gerbe is the word "if", but now
has been extended multiple steps farther out, to a greater distance:

   +--MVs-+---CV->+
   |      +Cs+-Sp-+--O--+
   |      |  |    |     |
   ?     if  ?  say     ?

Note that this section occurs in the sentence above, but it might also
occur in other sentences.  This is a very complex section.  If you observe
it many many times, if you observe it more than "average" (I think Ben and
Shujing call it "surprising".) then you say "oh, a ha!! this is an idiom or
an institutional phrase or a set phrase! or a lexical chunk!"  Because
there is a blank spot between "if... say''', this would be a circumposition.


Segmenting text into morphemes, or, in the case of Chinese, groupimg
multiple hanzi into a single word,  works **exactly the same way**, except,
for the case of Chinese, you do not allow circumpositions if you just want
to find single words.  But otherwise it is the same algorithm.

I was originally hoping that Shujing's pattern miner could do this, but I
don't think it is ready, and I don't have the time to figure it out.

The reason you *stil* want to read the papers on word segmentation and
morpho-syntax is that all of those ideas and algorithms in those papers
still apply, except that they now apply to sections of sheaves, instead of
adjacent words, or adjacent letters, or n-grams.  The point is that the
sections of the sheves tell you exactly how to work with complex structure,
instead of having to work with n-grams.

--linas




On Thu, Jul 13, 2017 at 2:20 AM, Linas Vepstas <[email protected]>
wrote:

>
>
> On Thu, Jul 13, 2017 at 1:53 AM, Ruiting Lian <[email protected]>
> wrote:
>
>>
>>
>>
>>
>> Ah, it's not that I don't like them. I just didn't see how they can be
>> 100% correct. Maybe I didn't explain well. Let me try again:
>>
>> 4 .学   <-->  5 . 社  cost= 1.2513192358165952
>>
>> So if you think this is a perfect word, which means, every link with MI
>> higher than 1.25.... should be considered as one word.
>>
>
> No, that is not correct. That's not how it works.
>
>
>> Then the followings should be also:
>>
>> 8 . 與   <-->  9 . 實  cost= 1.5154639602692228
>>
>
> Yes, I guess o.
>
>>
>>
>> 23 . 鹼   <-->  24 . （  cost= 2.0228678213402027
>>
>
> Yes I guess so
>
>>
>>
>> 2 . 把   <-->  3 . 蕾  cost= 3.301384076256177
>>
>
> Yes I guess so.
>
>>
>>
>> 4 . ，   <-->  5 . 此  cost= 1.9045829576234112
>>
>
> Yes I guess so.
>
>
>>
>> but none of them is supposed to be a word, and these links are not good
>> (no syntax relation, no semantic relation).
>>
>
> Sure, but you have exactly *one* example of each. Its not "statistics" if
> you have only one example. We would have to collect  hundreds of these.
> And I bet that there won't be hundreds, that you won't be able to find that
> many.
>
>
>>
>> Doing the MI between characters in morden Chinese is more like doing the
>> MI between letters in English words, you will get some strong patterns that
>> are frequently used in words, but that's very far away from understanding
>> the syntax relations and semantic relations.
>>
>
> Its also just plain not how it works. No one does it this way, and I am
> certainly not proposing that we should do this. That would be .. dumb. Its
> just not how it works.
>
> There are a number of good papers on how to do word segmentation
> correctly, and how to do morphosyntax correctly.  I don't have them here at
> the tip of my fingers I would have to search for them.   I'll try to send
> them tomorrow.
>
> --linas
>
>
>>
>>
>>>
>>>
>>>>
>>>>
>>>>>>
>>>>>> 把 蕾 當 作 戀 人 看 待 ， 記 憶 力 比 一 般  人 優 異 。
>>>>>>
>>>>>> 6 . 戀   <-->  17 . 人  cost= 3.346336621332618
>>>>>> 6 . 戀   <-->  7 . 人  cost= 3.346336621332618
>>>>>>
>>>>>>
>>>>>> The second link can indicate "戀(love) 人(people)" (lover/sweetheart)
>>>>>> is one word, but the same MI implies to the first link doesn't make sense
>>>>>> anymore, as they don't have strong relation in the sentence.
>>>>>>
>>>>>
>>>>> So 6-7 is one word, and 6-17 clearly cannot be one word, since the
>>>>> symbols are not next to each-other. So what's the problem?  We have one
>>>>> word that is 100% correct, and a hint that maybe other things are wrong.
>>>>>
>>>>> So it sounds like the first parse is perfect, and the second parse
>>>>> might have a problem ... and so? What's the actual problem?
>>>>>
>>>>
>>>> What I meant is, there shouldn't be link between 6-17, as they don't
>>>> have syntax relation from the grammar point of view.
>>>>
>>> Ahh!
>>>
>>> Yes, it was obvious in the earlier email that there might be something
>>> wrong here.
>>>
>>>
>>>> If you want to consider the semantic relation in this case, the link
>>>> should between 7-17.
>>>>
>>>
>>> OK, but so far that would still be linking to the same word, so its
>>> linking to the wrong morphome, but in the right word...
>>>
>> Recall that so far, these are jus spanning tree parses, not disjunct
>>> parses, so they are not going to provide word segmentation, and they are
>>> just a partial view of the syntax. Getting the word segmentation, and going
>>> over to disjuncts will presumably get better results.  So far, it seems
>>> like we are on the correct path word segmentation, and I assume the
>>> syntactic links might be OK.
>>>
>>> So since everything seems to be more or less correct, why don't you like
>>> it?
>>>
>>> --linas
>>>
>>>
>>>>
>>>>> I think if you look at how Russian is done, maybe these parses will be
>>>>> more clear.  For example, standard link-grammar generates
>>>>>
>>>>>     +------------------------------------------------Xp---------
>>>>> ---------------------------------------+
>>>>>     +---------------------Wd---------------------+
>>>>>                                        |
>>>>>     |       +-----------------EIw---------
>>>>> -------+                                                     |
>>>>>     |       +-------Jp-------+
>>>>> |                         +----------Mg---------+     |
>>>>>     |       |      +--LLAAQ--+        
>>>>> +---LLGJV--+-----SIm3----+-----Mg----+
>>>>> +--LLACG--+     |
>>>>>     |       |      |         |        |          |
>>>>> |           |           |         |     |
>>>>> LEFT-WALL в.jp <http://xn--b1a.jp> коридор.= =е.ndmsp послыш.=
>>>>> =ался.vsndpms грохот.ndmsi сапог.ndmpg стражник.= =ов.nlmpg .
>>>>>
>>>>>
>>>>> The LL links are the links between morphemes.
>>>>>
>>>>
>>>> The link between Chinese characters in one word is really not anything
>>>> like morphemes.
>>>>
>>>>
>>>>
>>>>>
>>>>> --linas
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 11, 2017 at 11:39 PM, Linas Vepstas <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> At a minimum, what the word segmentation for these sentences should
>>>>>>>> have been? And what the linkages should have been, at least 
>>>>>>>> approximately?
>>>>>>>>
>>>>>>>> --linas
>>>>>>>>
>>>>>>>> On Tue, Jul 11, 2017 at 11:35 PM, Linas Vepstas <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> ah, it would be great if I got a translation.  I know that's
>>>>>>>>> tedious, but I don't know of an easier way.  Is it junkier than the 
>>>>>>>>> English?
>>>>>>>>>
>>>>>>>>> --linas
>>>>>>>>>
>>>>>>>>> On Tue, Jul 11, 2017 at 10:48 PM, Ben Goertzel <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The MST parses you sent contain quite a lot of gibberish,
>>>>>>>>>> according to
>>>>>>>>>> what Ruiting observed to me yesterday... (she's not at the
>>>>>>>>>> computer
>>>>>>>>>> right now)
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 12, 2017 at 11:39 AM, Linas Vepstas <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> > Yes, that's fine.
>>>>>>>>>> >
>>>>>>>>>> > Did you look at the MST parses I sent you? I would much rather
>>>>>>>>>> use those, if
>>>>>>>>>> > they aren't terrible, because this would save a lot of time and
>>>>>>>>>> effort.
>>>>>>>>>> >
>>>>>>>>>> > --linas
>>>>>>>>>> >
>>>>>>>>>> > On Tue, Jul 11, 2017 at 8:26 AM, Ruiting Lian <
>>>>>>>>>> [email protected]>
>>>>>>>>>> > wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> Hi Linas,
>>>>>>>>>> >>
>>>>>>>>>> >> Can you check if the attached format is OK with you? Or you
>>>>>>>>>> want a cleaner
>>>>>>>>>> >> format?
>>>>>>>>>> >>
>>>>>>>>>> >> The encoding is GBK.
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> The current format is:
>>>>>>>>>> >> ===============
>>>>>>>>>> >> Sentence #2 (14 tokens):
>>>>>>>>>> >> 陈清扬当时二十六岁，就在我插队的地方当医生。
>>>>>>>>>> >> [Text=陈清扬 CharacterOffsetBegin=50 CharacterOffsetEnd=53]
>>>>>>>>>> >> [Text=当时 CharacterOffsetBegin=53 CharacterOffsetEnd=55]
>>>>>>>>>> >> [Text=二十六 CharacterOffsetBegin=55 CharacterOffsetEnd=58]
>>>>>>>>>> >> [Text=岁 CharacterOffsetBegin=58 CharacterOffsetEnd=59]
>>>>>>>>>> >> [Text=， CharacterOffsetBegin=59 CharacterOffsetEnd=60]
>>>>>>>>>> >> [Text=就 CharacterOffsetBegin=60 CharacterOffsetEnd=61]
>>>>>>>>>> >> [Text=在 CharacterOffsetBegin=61 CharacterOffsetEnd=62]
>>>>>>>>>> >> [Text=我 CharacterOffsetBegin=62 CharacterOffsetEnd=63]
>>>>>>>>>> >> [Text=插队 CharacterOffsetBegin=63 CharacterOffsetEnd=65]
>>>>>>>>>> >> [Text=的 CharacterOffsetBegin=65 CharacterOffsetEnd=66]
>>>>>>>>>> >> [Text=地方 CharacterOffsetBegin=66 CharacterOffsetEnd=68]
>>>>>>>>>> >> [Text=当 CharacterOffsetBegin=68 CharacterOffsetEnd=69]
>>>>>>>>>> >> [Text=医生 CharacterOffsetBegin=69 CharacterOffsetEnd=71]
>>>>>>>>>> >> [Text=。 CharacterOffsetBegin=71 CharacterOffsetEnd=72]
>>>>>>>>>> >> =========================
>>>>>>>>>> >>
>>>>>>>>>> >> The simple format can be like English sentences using space to
>>>>>>>>>> separate
>>>>>>>>>> >> the words:
>>>>>>>>>> >>
>>>>>>>>>> >> ===============
>>>>>>>>>> >> Sentence #2 (14 tokens):
>>>>>>>>>> >> 陈清扬 当时 二十六 岁 ， 就 在 我 插队 的 地方 当 医生 。
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> ----
>>>>>>>>>> >> Ruiting Lian
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> On Mon, Jul 10, 2017 at 12:13 PM, Linas Vepstas <
>>>>>>>>>> [email protected]>
>>>>>>>>>> >> wrote:
>>>>>>>>>> >>>
>>>>>>>>>> >>> Here's what I have for non-segmented data:
>>>>>>>>>> >>>
>>>>>>>>>> >>> The current Mandarin dataset has enough English in it to
>>>>>>>>>> parse even
>>>>>>>>>> >>> simple English, not well, but not terribly badly, either.
>>>>>>>>>> Basically, the
>>>>>>>>>> >>> wikipedia articles contain scattered English, and this is
>>>>>>>>>> treated just like
>>>>>>>>>> >>> everything else.
>>>>>>>>>> >>>
>>>>>>>>>> >>> The cost is the MI of that pair. The integers are the
>>>>>>>>>> word-ordinals.
>>>>>>>>>> >>>
>>>>>>>>>> >>> (print-mst  "this is a test")
>>>>>>>>>> >>> 2 . this   <-->  3 . is  cost= 12.596325047561628
>>>>>>>>>> >>> 3 . is   <-->  4 . a  cost= 10.684454066752686
>>>>>>>>>> >>> 1 . ###LEFT-WALL###   <-->  3 . is  cost= 0.5236859955283215
>>>>>>>>>> >>> 1 . ###LEFT-WALL###   <-->  5 . test  cost=
>>>>>>>>>> -0.8277888080021967
>>>>>>>>>> >>>
>>>>>>>>>> >>> (print-mst  "it is surprsing that this works")
>>>>>>>>>> >>> 5 . that   <-->  7 . works  cost= 12.824735708193995
>>>>>>>>>> >>> 5 . that   <-->  6 . this  cost= 11.23043475606367
>>>>>>>>>> >>> 3 . is   <-->  5 . that  cost= 8.50983391898189
>>>>>>>>>> >>> 2 . it   <-->  3 . is  cost= 12.720811706832116
>>>>>>>>>> >>> 1 . ###LEFT-WALL###   <-->  7 . works  cost=
>>>>>>>>>> 0.7545942265555112
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> Some random sentences:
>>>>>>>>>> >>>
>>>>>>>>>> >>> 也 常 在 工 業 上 與 實 驗 室 中 ， 用 於 有 機 合 成 中 的 強 鹼 （ 超 強 鹼 ）  。
>>>>>>>>>> >>>
>>>>>>>>>> >>> 長 時 間 都 跟 男 人 接 觸 ， 不 擅 長 對 待 女 孩 子 。
>>>>>>>>>> >>>
>>>>>>>>>> >>> 把 蕾 當 作 戀 人 看 待 ， 記 憶 力 比 一 般 人 優 異 。
>>>>>>>>>> >>>
>>>>>>>>>> >>> 道 德 学 社 创 始 人 。
>>>>>>>>>> >>>
>>>>>>>>>> >>> 然 而 ， 此 混 战 亦 可 坚 持 三 十 六 日 。
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> These are segmented so that its two hanzi per pair. There is
>>>>>>>>>> NO word
>>>>>>>>>> >>> segmentation!! I am hoping that word segmentation will happen
>>>>>>>>>> >>> "automatically". See below.
>>>>>>>>>> >>>
>>>>>>>>>> >>> I do not yet have software to draw the graphs.  You will have
>>>>>>>>>> to do this
>>>>>>>>>> >>> by hand, for now.
>>>>>>>>>> >>>
>>>>>>>>>> >>> The parses:
>>>>>>>>>> >>>
>>>>>>>>>> >>> (print-mst  "也 常 在 工 業 上 與 實 驗 室 中 ， 用 於 有  機 合 成 中 的 強 鹼 （ 超
>>>>>>>>>> 強 鹼 ）  。")
>>>>>>>>>> >>> 9 . 實   <-->  10 . 驗  cost= 8.67839880455297
>>>>>>>>>> >>> 10 . 驗   <-->  11 . 室  cost= 7.808708535957656
>>>>>>>>>> >>> 5 . 工   <-->  11 . 室  cost= 3.628109329166861
>>>>>>>>>> >>> 5 . 工   <-->  6 . 業  cost= 5.267964316861255
>>>>>>>>>> >>> 8 . 與   <-->  9 . 實  cost= 1.5154639602692228
>>>>>>>>>> >>> 11 . 室   <-->  18 . 合  cost= 1.3112275389754586
>>>>>>>>>> >>> 18 . 合   <-->  19 . 成  cost= 2.8487390406796376
>>>>>>>>>> >>> 2 . 也   <-->  19 . 成  cost= 1.3284690162639947
>>>>>>>>>> >>> 2 . 也   <-->  3 . 常  cost= 2.670074279865112
>>>>>>>>>> >>> 2 . 也   <-->  4 . 在  cost= 1.5999478270616105
>>>>>>>>>> >>> 19 . 成   <-->  27 . 鹼  cost= 1.0648751602654372
>>>>>>>>>> >>> 23 . 鹼   <-->  27 . 鹼  cost= 7.397179300260998
>>>>>>>>>> >>> 22 . 強   <-->  23 . 鹼  cost= 6.6278644991783935
>>>>>>>>>> >>> 26 . 強   <-->  27 . 鹼  cost= 6.6278644991783935
>>>>>>>>>> >>> 25 . 超   <-->  26 . 強  cost= 4.233373422153647
>>>>>>>>>> >>> 21 . 的   <-->  23 . 鹼  cost= 2.189776253921316
>>>>>>>>>> >>> 23 . 鹼   <-->  24 . （  cost= 2.0228678213402027
>>>>>>>>>> >>> 20 . 中   <-->  21 . 的  cost= 0.7882106238577222
>>>>>>>>>> >>> 11 . 室   <-->  13 . ，  cost= 0.7204578405457767
>>>>>>>>>> >>> 13 . ，   <-->  15 . 於  cost= 1.3787498338128241
>>>>>>>>>> >>> 14 . 用   <-->  15 . 於  cost= 2.3097280192674408
>>>>>>>>>> >>> 13 . ，   <-->  16 . 有  cost= 0.8940307468617856
>>>>>>>>>> >>> 16 . 有   <-->  17 . 機  cost= 1.5798315464927484
>>>>>>>>>> >>> 12 . 中   <-->  13 . ，  cost= 0.702836797299172
>>>>>>>>>> >>> 19 . 成   <-->  29 . 。  cost= 0.5752920624938298
>>>>>>>>>> >>> 28 . ）   <-->  29 . 。  cost= 1.1524305408117694
>>>>>>>>>> >>> 6 . 業   <-->  7 . 上  cost= -0.02469597819435876
>>>>>>>>>> >>> 1 . ###LEFT-WALL###   <-->  2 . 也  cost= -0.5802694975440783
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> scheme@(guile-user)> (print-mst  "長 時 間 都 跟 男 人 接 觸 ， 不 擅 長
>>>>>>>>>> 對 待  女
>>>>>>>>>> >>> 孩 子 。")
>>>>>>>>>> >>> 9 . 接   <-->  10 . 觸  cost= 8.386123586043663
>>>>>>>>>> >>> 9 . 接   <-->  16 . 待  cost= 5.500641386527644
>>>>>>>>>> >>> 15 . 對   <-->  16 . 待  cost= 4.527961561831734
>>>>>>>>>> >>> 4 . 間   <-->  9 . 接  cost= 3.011954338244294
>>>>>>>>>> >>> 3 . 時   <-->  4 . 間  cost= 5.53719025855275
>>>>>>>>>> >>> 2 . 長   <-->  4 . 間  cost= 1.4431523444018381
>>>>>>>>>> >>> 2 . 長   <-->  19 . 子  cost= 1.6192113159084691
>>>>>>>>>> >>> 18 . 孩   <-->  19 . 子  cost= 6.867211688834326
>>>>>>>>>> >>> 17 . 女   <-->  18 . 孩  cost= 7.309507215753307
>>>>>>>>>> >>> 13 . 擅   <-->  16 . 待  cost= 1.4396781889010661
>>>>>>>>>> >>> 13 . 擅   <-->  14 . 長  cost= 7.846183142809686
>>>>>>>>>> >>> 12 . 不   <-->  13 . 擅  cost= 4.148490791614584
>>>>>>>>>> >>> 11 . ，   <-->  13 . 擅  cost= 2.360114646965709
>>>>>>>>>> >>> 4 . 間   <-->  5 . 都  cost= 0.9693867850594948
>>>>>>>>>> >>> 5 . 都   <-->  6 . 跟  cost= 2.3990108227984237
>>>>>>>>>> >>> 6 . 跟   <-->  7 . 男  cost= 1.3528584177181244
>>>>>>>>>> >>> 7 . 男   <-->  8 . 人  cost= 2.705451519003308
>>>>>>>>>> >>> 2 . 長   <-->  20 . 。  cost= 0.9405658895019222
>>>>>>>>>> >>> 1 . ###LEFT-WALL###   <-->  2 . 長  cost= -0.11680557768830013
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> scheme@(guile-user)> (print-mst  "把 蕾 當 作 戀 人 看 待 ， 記 憶 力 比
>>>>>>>>>> 一 般  人 優 異
>>>>>>>>>> >>> 。")
>>>>>>>>>> >>> 11 . 記   <-->  12 . 憶  cost= 9.984319885371686
>>>>>>>>>> >>> 6 . 戀   <-->  12 . 憶  cost= 4.479455917846746
>>>>>>>>>> >>> 6 . 戀   <-->  17 . 人  cost= 3.346336621332618
>>>>>>>>>> >>> 6 . 戀   <-->  7 . 人  cost= 3.346336621332618
>>>>>>>>>> >>> 6 . 戀   <-->  19 . 異  cost= 3.1326057184954585
>>>>>>>>>> >>> 18 . 優   <-->  19 . 異  cost= 7.429036103043035
>>>>>>>>>> >>> 12 . 憶   <-->  13 . 力  cost= 2.9092434467610424
>>>>>>>>>> >>> 7 . 人   <-->  8 . 看  cost= 1.5817340934392554
>>>>>>>>>> >>> 8 . 看   <-->  9 . 待  cost= 6.40562791854984
>>>>>>>>>> >>> 6 . 戀   <-->  16 . 般  cost= 1.532540145438297
>>>>>>>>>> >>> 15 . 一   <-->  16 . 般  cost= 6.011468981031394
>>>>>>>>>> >>> 14 . 比   <-->  16 . 般  cost= 2.133858423252395
>>>>>>>>>> >>> 19 . 異   <-->  20 . 。  cost= 1.062610265554028
>>>>>>>>>> >>> 5 . 作   <-->  20 . 。  cost= 0.9473091730162793
>>>>>>>>>> >>> 4 . 當   <-->  5 . 作  cost= 1.8844566876556712
>>>>>>>>>> >>> 2 . 把   <-->  4 . 當  cost= 1.601964785223224
>>>>>>>>>> >>> 2 . 把   <-->  3 . 蕾  cost= 3.301384076256177
>>>>>>>>>> >>> 1 . ###LEFT-WALL###   <-->  4 . 當  cost= 1.4277603296921786
>>>>>>>>>> >>> 8 . 看   <-->  10 . ，  cost= 0.7854131431579994
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> scheme@(guile-user)> (print-mst  "道 德 学 社 创 始 人 。")
>>>>>>>>>> >>> 6 . 创   <-->  7 . 始 cost= 5.70773284687867
>>>>>>>>>> >>> 4 . 学   <-->  6 . 创  cost= 3.061665951119931
>>>>>>>>>> >>> 6 . 创   <-->  8 . 人  cost= 2.1724657173546227
>>>>>>>>>> >>> 4 . 学   <-->  5 . 社  cost= 1.2513192358165952
>>>>>>>>>> >>> 8 . 人   <-->  9 . 。  cost= 0.9996485775966537
>>>>>>>>>> >>> 2 . 道   <-->  9 . 。  cost= 0.566680489393292
>>>>>>>>>> >>> 2 . 道   <-->  3 . 德  cost= 2.4802112908263467
>>>>>>>>>> >>> 1 . ###LEFT-WALL###   <-->  2 . 道  cost= -0.4271902589965446
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> scheme@(guile-user)> (print-mst  "然 而 ， 此 混 战 亦 可 坚 持 三 十 六
>>>>>>>>>> 日 。")
>>>>>>>>>> >>> 10 . 坚   <-->  11 . 持  cost= 8.090136242433331
>>>>>>>>>> >>> 2 . 然   <-->  10 . 坚  cost= 3.986785521916751
>>>>>>>>>> >>> 2 . 然   <-->  3 . 而  cost= 5.260819401185174
>>>>>>>>>> >>> 1 . ###LEFT-WALL###   <-->  2 . 然  cost= 1.960750908016161
>>>>>>>>>> >>> 2 . 然   <-->  5 . 此  cost= 1.6401055467853318
>>>>>>>>>> >>> 4 . ，   <-->  5 . 此  cost= 1.9045829576234112
>>>>>>>>>> >>> 5 . 此   <-->  8 . 亦  cost= 1.6941756040836786
>>>>>>>>>> >>> 8 . 亦   <-->  9 . 可  cost= 3.9818559895630496
>>>>>>>>>> >>> 5 . 此   <-->  7 . 战  cost= 0.7150479036642032
>>>>>>>>>> >>> 6 . 混   <-->  7 . 战  cost= 3.4641015339868524
>>>>>>>>>> >>> 11 . 持   <-->  16 . 。  cost= 0.667356443539326
>>>>>>>>>> >>> 11 . 持   <-->  12 . 三  cost= -0.082652373159668
>>>>>>>>>> >>> 12 . 三   <-->  13 . 十  cost= 4.910027341075681
>>>>>>>>>> >>> 13 . 十   <-->  14 . 六  cost= 6.198613466629329
>>>>>>>>>> >>> 13 . 十   <-->  15 . 日  cost= 1.2911435674834095
>>>>>>>>>> >>> scheme@(guile-user)>
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> On Thu, Jul 6, 2017 at 7:57 PM, Ben Goertzel <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> >> The primary issue I'm still very much struggling with is
>>>>>>>>>> that I am
>>>>>>>>>> >>>> >> not happy
>>>>>>>>>> >>>> >> with the classification of words into word-groups. I've
>>>>>>>>>> run multiple
>>>>>>>>>> >>>> >> experiments, all of what are a tad underwhelming. They're
>>>>>>>>>> not bad,
>>>>>>>>>> >>>> >> they're
>>>>>>>>>> >>>> >> just not yet very good, either.
>>>>>>>>>> >>>> >
>>>>>>>>>> >>>> > Andres Suarez (an intern here), together with me and
>>>>>>>>>> Curtis, is
>>>>>>>>>> >>>> > experimenting with a modification of Adagram (itself an
>>>>>>>>>> extension of
>>>>>>>>>> >>>> > word2vec to handle disambiguation) for this...   We'll
>>>>>>>>>> share any
>>>>>>>>>> >>>> > meaningful results we get...
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> but... yeah... it's a hard problem and surely the right
>>>>>>>>>> place to be
>>>>>>>>>> >>>> stuck...
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> --
>>>>>>>>>> >>>> Ben Goertzel, PhD
>>>>>>>>>> >>>> http://goertzel.org
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> "I am God! I am nothing, I'm play, I am freedom, I am life.
>>>>>>>>>> I am the
>>>>>>>>>> >>>> boundary, I am the peak." -- Alexander Scriabin
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ben Goertzel, PhD
>>>>>>>>>> http://goertzel.org
>>>>>>>>>>
>>>>>>>>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am
>>>>>>>>>> the
>>>>>>>>>> boundary, I am the peak." -- Alexander Scriabin
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36KYEU-knAAFdie7FMikiXeE7saO9JcbT5dFkOSTpCp4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Chinese language learning pipeline

Reply via email to