Hi Andi, I'm not sure how to respond to this. Every word is already an int
--  a 32-bit int, which happens to be a pointer to the string of letters in
the word. Maybe you could squeeze this down to 16 bits, but what's the
point?

The association of integers to "things" is called an "index", and there are
many kinds of indexes: vectors (arrays), rb-trees and hash tables being the
most popular.  Pretty much all software that does almost anything at all is
packed to the gills with indexes of every kind. Its pretty fundamental to
the definition of what computing is all about.

Simply having an index of words is not enough to do any kind of textual
analysis at all. Typically, you need to know how often a word occurs, how
often it occurs next to other words, whether it occurs more frequently on
one page than another. You've got to compute all this information, and
more, store it somewhere too, and apply god-knows-what algorithms to it:
LSA or MI or word2vec or whatever.  Replacing a 32-bit pointer to a word
string by a 16-bit int does pretty much precisely nothing to simplify the
complexity of the data analysis problem.

Seriously, think about it. Google invented map-reduce to solve their data
analysis problems. Apache Foundation shepherds along hadoop and tinkerpop
and cassandra to deal with the indexing problem.  Text analysis and big
data are just huge parts of the economy these days.  Quite infamously,
Cambridge Analytica used text analysis to help get Trump elected. We are
not living in the 1960's.

--linas

On Wed, May 10, 2017 at 12:46 PM, Andi <[email protected]> wrote:

> come on linas, seems that you don't understand me on purpos :)
>
> you make a python-like dictionary for every single symbol in your text.
> one symbol (word or sign or space etc.) is represented by one int. if there
> are less than 64k different symbols (words), what will be true for most
> books, you can take 16bit. with this you can put a medium sized book with
> maybe 400 pages directly into the CPU-cache and do your operations very
> quick....
>
> Building such a dictionary should be very quick on the run and speed up
> everything and something like a tree of distances will become handy.
>
> knowing that this is your special domain....
>
> but this is what i am thinking about it today :)
>
> --Andi
>
> Am Mittwoch, 10. Mai 2017 19:06:45 UTC+2 schrieb linas:
>>
>>
>>
>> On Wed, May 10, 2017 at 7:46 AM, Andi <[email protected]> wrote:
>>
>>> Linas, thank you for your precise and profound explanations!
>>>
>>
>> You are welcome! The more who understand this stuff, the better!
>>
>>>
>>> As far as I understand, what is going on here at OpenCog, an Atom is the
>>> most universal thing in the universe - able to represent  "all that is the
>>> case" - how Witti would say.
>>>
>>
>> Yeah, I'm not sure where that name comes from. Opencog stole it from
>> textbooks on logic; where it was before that I don't know. It might date to
>> Whitehead and Hilbert.
>>
>>>
>>> Universality is always in contradiction to performance. One can not
>>> balance this.
>>> I think a step to overcome this is to compile certain types of atoms at
>>> run time to something optimized for performance and than recompile the
>>> results back to regular atoms.
>>>
>>
>> Well, we do: some atoms have C++ counterparts. The most complicated of
>> these is the PatternLink, which stores a pre-compiled copies of the
>> patterns that is searches for.  That way, when you call it, all the
>> machinery is there, warm and ready to go.
>>
>> > Maybe especially at your main topic - link grammar.
>> > Somewhere I read your complaints, how slow it became when you ported it
>> to the atomspace.
>>
>> > My thoughts about this was that there should be a possibility to
>> transform a
>> > given  text corpus to a list of integers, where every int represents a
>> word or
>> > sign, operate on this list and bring back the results to the atom
>> space.
>>
>> Heh. You are on a slippery slope here. **everything** inside a computer
>> is a "list of integers".  the question is always "which list of integers
>> should it be".
>>
>> -- Linas
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA34PDrc7%2BTPSOLg8LRfMOk7x%3D22s2VjgX-w497j62VZASg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to