Hi Andi, I'm not sure how to respond to this. Every word is already an int -- a 32-bit int, which happens to be a pointer to the string of letters in the word. Maybe you could squeeze this down to 16 bits, but what's the point?
The association of integers to "things" is called an "index", and there are many kinds of indexes: vectors (arrays), rb-trees and hash tables being the most popular. Pretty much all software that does almost anything at all is packed to the gills with indexes of every kind. Its pretty fundamental to the definition of what computing is all about. Simply having an index of words is not enough to do any kind of textual analysis at all. Typically, you need to know how often a word occurs, how often it occurs next to other words, whether it occurs more frequently on one page than another. You've got to compute all this information, and more, store it somewhere too, and apply god-knows-what algorithms to it: LSA or MI or word2vec or whatever. Replacing a 32-bit pointer to a word string by a 16-bit int does pretty much precisely nothing to simplify the complexity of the data analysis problem. Seriously, think about it. Google invented map-reduce to solve their data analysis problems. Apache Foundation shepherds along hadoop and tinkerpop and cassandra to deal with the indexing problem. Text analysis and big data are just huge parts of the economy these days. Quite infamously, Cambridge Analytica used text analysis to help get Trump elected. We are not living in the 1960's. --linas On Wed, May 10, 2017 at 12:46 PM, Andi <[email protected]> wrote: > come on linas, seems that you don't understand me on purpos :) > > you make a python-like dictionary for every single symbol in your text. > one symbol (word or sign or space etc.) is represented by one int. if there > are less than 64k different symbols (words), what will be true for most > books, you can take 16bit. with this you can put a medium sized book with > maybe 400 pages directly into the CPU-cache and do your operations very > quick.... > > Building such a dictionary should be very quick on the run and speed up > everything and something like a tree of distances will become handy. > > knowing that this is your special domain.... > > but this is what i am thinking about it today :) > > --Andi > > Am Mittwoch, 10. Mai 2017 19:06:45 UTC+2 schrieb linas: >> >> >> >> On Wed, May 10, 2017 at 7:46 AM, Andi <[email protected]> wrote: >> >>> Linas, thank you for your precise and profound explanations! >>> >> >> You are welcome! The more who understand this stuff, the better! >> >>> >>> As far as I understand, what is going on here at OpenCog, an Atom is the >>> most universal thing in the universe - able to represent "all that is the >>> case" - how Witti would say. >>> >> >> Yeah, I'm not sure where that name comes from. Opencog stole it from >> textbooks on logic; where it was before that I don't know. It might date to >> Whitehead and Hilbert. >> >>> >>> Universality is always in contradiction to performance. One can not >>> balance this. >>> I think a step to overcome this is to compile certain types of atoms at >>> run time to something optimized for performance and than recompile the >>> results back to regular atoms. >>> >> >> Well, we do: some atoms have C++ counterparts. The most complicated of >> these is the PatternLink, which stores a pre-compiled copies of the >> patterns that is searches for. That way, when you call it, all the >> machinery is there, warm and ready to go. >> >> > Maybe especially at your main topic - link grammar. >> > Somewhere I read your complaints, how slow it became when you ported it >> to the atomspace. >> >> > My thoughts about this was that there should be a possibility to >> transform a >> > given text corpus to a list of integers, where every int represents a >> word or >> > sign, operate on this list and bring back the results to the atom >> space. >> >> Heh. You are on a slippery slope here. **everything** inside a computer >> is a "list of integers". the question is always "which list of integers >> should it be". >> >> -- Linas >> >> >> >> -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34PDrc7%2BTPSOLg8LRfMOk7x%3D22s2VjgX-w497j62VZASg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
