Re: [agi] Synopticon and Lossless Compression Language Modeling

Matt Mahoney Sun, 27 Dec 2020 20:10:33 -0800

All translation loses information. To translate exactly, you need to add
words to the output, whether it's English to Chinese, or Chinese to
English. Words to pages can happen between any culture. Like to translate a
word like "qubit" I have to write a book on quantum mechanics.


On Sun, Dec 27, 2020, 2:36 PM James Bowery <jabow...@gmail.com> wrote:

> The choice of English is, I think obviously, good for the reasons you
> point out.  However, cutting Bob a little slack on his statement,:
>
> Even within the Germanic branch of IndoEuropean language groups people
> will frequently make reference to the difficulty of translating subtle but
> critical nuance of, say, the original Dutsch of a relatively technical
> philosophical text such as "Being and Time".  The translation ends up being
> "difficult" at best to the English reader (not that Heiddeger is a walk in
> the park even for a native speaker).  It wouldn't surprise me to find
> _occasional_ Chinese characters that express ideas, the subtle but critical
> nuances of which require "pages".
>
> The Weltanschauung of various cultures is hard to get across without
> immersion.  So even if there were to exist translation software with
> sufficient sophistication to do them justice, it is likely that such
> translations would occasionally appear as opaque as "Being and Time" if not
> more so.  Multiply the greater cultural distance of China and England times
> the greater "character set" of Chinese than English and I think Bob's
> qualifier "may" as in "may take several pages", as opposed to "will
> frequently take several pages" is excusable.
>
> On Sun, Dec 27, 2020 at 12:23 PM Matt Mahoney <mattmahone...@gmail.com>
> wrote:
>
>> I don't believe that a few characters in Chinese translate to a few pages
>> of English. Most translations I have seen are of similar size. A word in
>> Chinese is 1 to 4 symbols (syllables) from a 7000 character alphabet.
>>
>> I used English rather than a multilingual corpus for the Hutter prize
>> because a computer does not have to know more than one language to pass the
>> Turing test. Chinese has the most native speakers, but English is the most
>> widely spoken when including second languages. English is taught in most
>> primary schools around the world including China.
>>
>> I discussed the choice of language in
>> http://mattmahoney.net/dc/rationale.html
>> But I admit I'm biased toward my native language.
>>
>> On Sun, Dec 27, 2020, 12:30 PM James Bowery <jabow...@gmail.com> wrote:
>>
>>> This email is from a late colleague who developed the zero address
>>> architecture for Burroughs, written to me shortly after the announcement of
>>> The Hutter Prize for Lossless Compression of Human Knowledge.  If anyone
>>> knows how to get in contact with Jun Gu, let me know.  I've followed up all
>>> the leads left in the West and come up with nothing but dead ends in the
>>> PRC and Hong Kong:
>>>
>>> john97j...@aol.com
>>> Aug 31, 2006, 7:22 AM
>>> to me
>>>
>>> Jim,
>>>
>>> Syntopticon is the Britannica encyclopdedia of ideas, and the
>>> Syntopticon itself is the (one small volume) codex cross referencing all
>>> the great ideas of mankind contained in all the many other books of this
>>> reference system. I have an old copy and it is the size of the Britannica
>>> encyclopedia (ie an 8' long collection of big books).  The importance of
>>> Syntopticon in the Hutter competition is that expression of ideas in
>>> different natural languages is a very different proposition.  Since an idea
>>> in Chinese is typically represented by one or a few symbols, and that same
>>> idea in English may take several pages, there is a huge difference to start
>>> with in the effectiveness of searches in Chinese or English.  There is also
>>> a  difference in the size of the database containing say a Chinese version
>>> of Syntopticon or an English version.  And a much greater difference in the
>>> effectiveness of doing automated (programmatic) searches in Chinese or
>>> English.
>>>
>>> But from the perspective of doing analyses on these databases
>>> (especially if one represents say Wikipedia in Chinese), AND I do my vector
>>> fusion conversion of that Chinese database to translate (and further
>>> compress) those symbols into an analytically tractable form, the analysis
>>> of ideas is a vastly different problem than if one attempts to do this all
>>> in English. Searches are certainly one analytic process of major interest
>>> in natural languages; drawing conclusions, making predictions, and drawing
>>> inferences are other less investigated processes. My PhD graduate student
>>> Rok Sosic's thesis was on software that understands what it is doing
>>> (title: The Many Faces of Introspection, Utah, 1992).  His poem summarizing
>>> his thesis was:
>>>
>>>
>>>
>>>
>>> *The box is a secret, knotty, black        It's so complex that I've
>>> lost track.        If somehow  it's made reflective,        The box will be
>>> much more effective*.
>>>
>>> Computerdom does not have a lot of art in inference engines (making
>>> predictions).  The most effective inference engine that I know of is the
>>> software done for Colossus, Turing's code breaking "computer" of WWII.  The
>>> Brits still treat that software as classified even though the hardware has
>>> been declassified for years.  So far as I know, nobody outside of  UK knows
>>> the details of that software.  My point here is that drawing understanding
>>> from natural languages is a relatively small art practiced mostly by
>>> cryptoanalysts.  And my further point is that the natural language of
>>> interest (be it English, Chinese, Mayan or ...) has a major influence on
>>> how one (person or program) goes about doing analyses and making inferences.
>>>
>>> From a practical perspective, the Hutter challenge would be much more
>>> tractable for at least me if I could do it in Chinese.  My first PhD
>>> student was Jun Gu who is currently Chief Information Scientist for PRC.
>>> His thesis was on efficient compression technologies.
>>>
>>> If you wish, you can share these thoughts with whomever you please.
>>>
>>> Bob Johnson
>>> Prof. Emeritus
>>> Computer Science
>>> Univ. of Utah
>>>
>> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> + delivery
> options <https://agi.topicbox.com/groups/agi/subscription> Permalink
> <https://agi.topicbox.com/groups/agi/T6886ccc561eb3cf3-M852aa351bec340e39e69e292>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T6886ccc561eb3cf3-M62152546471de30893cebd2f
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Synopticon and Lossless Compression Language Modeling

Reply via email to