Hi Abu, Let me respond in reverse order.
On Tue, Jan 7, 2025 at 2:57 PM Abu Naser <[email protected]> wrote: > > Thank you very much for your very informative email. Among topics you > mentioned, following two sounds interesting: > > 1) Pattern discovery > 2) Hooking up an LLM-based chatbot to a large genomics data. > > What tools do you have for pattern discovery? > Regarding LLM-based chatbot, is it expected to implement LLM chatbot from > the scratch? "From scratch" sounds so pessimistic. A good place to start would be Llama https://en.wikipedia.org/wiki/Llama_(language_model) -- the models are freely available, the code to generate more is GPL'ed. I'm unclear about what sort of compute resources are needed to deploy. So let me hop back to item 1. Here's how I do pattern discovery, personally, on my own pet project. "There are many, but this one is mine". So. I start with a system that obtains pair-wise correlations between "things", could be anything, as long as they can be tagged. This generates high-dimensional sparse vectors. So, if you have a million "things", then there is a N=1 million-dimensional vector, since, for any one item, there might be any of 999,999 others it might be related to. It is a sparse vector, because most of the other relations are zero. (The atomspace is highly optimized for storing sparse vectors) These vectors exhibit all the classical properties of vector embeddings: for example, the classic "king - man + woman = queen" embedding pops up trivially, without any work. But pair-wise correlations are boring and old-hat, so my next step is to create tensors (I often call them "jigsaws", but people react negatively to that term. Meanwhile, the word "tensor" has a sophisticated sheen of respectability to it, even though the tensors of general relativity, and quantum mechanics, and .. neural nets, are all exactly jigsaws.) A tensor is, specifically, a segment of a network graph, where some of the connecting edges have been cut, to create a disconnected graph component. The cut edges are not discarded, but are instead tagged with a type marker, so that they could be reconnected, if/when desired. This is the "jigsaw". So then I look for pairwise correlations between tensors, to create a vector of tensors. Lather, rinse, repeat. Well, not quite: in between is a clustering step: most of these (vectors of) tensors look similar to one-another, in that they connect in similar ways. An example from genomics would be a gene that has similar function in mammals and insects, perhaps because it's highly conserved, or whatever. Judging similarity is done using vector products, Cosine dot products are a reasonable start; but I like certain information-theoretic, Bayesian-style products better. But they're all in the same ballpark. The clustering step is part of the "information discovery" or "pattern mining" of the process: classifying similar things. I do the classification step before the second pair-wise step. So, vectorize, classify, tensorize, classify, repeat. Tensors can be contracted, so the last step re-assembles the network connections, this time using the generic, abstracted classes, instead of the specific, concrete exemplars. In genomics, it would be like saying "these kinds of genes, as a class, interact with those kinds of genes, as a class, and up/down regulate or express these kinds of proteins, as a class". The class may be a class of one. I find the size of the classes have a square-root-zipfian distribution. Why? I don't know. I have measured this for genes, proteins; someone once gave me a dataset, years ago The goal of pairing, tensoring, classifying, and then doing it again is to ladder my way up to large-scale, complex structures, built out of small-scale itty-bitty structures. For example, say, discovering how different variants of a Krebs cycle interact with other cycles and regulatory mechanisms in different species, or something like that. (I'm NOT a biology/genetics guy, I'm making up things I imagine might be interesting to you, for illustration.) I think it's a cool idea, but very few are enthused by it. There are many practical problems. Foremost is that I don't have a mouse-n-windows interface to this. You cannot just click on this item and ask "show me all structural relationships that this participates in", which is what people want. Second is that this is implemented in an ad-hoc collection of software bits-pieces-parts. Some of those pieces are highly polished, carefully documented, fully debugged. Others are duct-tape and string. The process is a batch process: press a button, wait a few hours, a day or a week, get gigabytes of results, and then feed it to the next stage. And this is where I got tangled and lost. The next stage, the recursive step, seems to work great, but the batch processing is killing me. I've got hundreds of ten-gigabyte-sized datasets, each with different properties, different defects, different results, incompatible with the last batch, etc. Waiting a week for an experiment to run, only to realize there was a mistake in the pipeline, or that I should tune some parameter .. it's a mind-killer. I thought to move away from batch, to stream processing, and then got bogged down. Then the meta-question: I personally think that this is a great way of extracting hierarchical structure from complex networks. But there are people who want to see accuracy scores on standard benchmarks, so that they can compare to their favorite horse in the horse-race ... Gah. I'm not racing horses in a horse race; I'm trying to understand how horses work. For starters, a leg at each corner. https://m.media-amazon.com/images/I/51deqr5XrYL._SY445_SX342_.jpg Would this process create anything useful for you, for genomics? I dunno. Maybe, maybe not. Back to part two. The LLM API would be verbal: "find all genes that upregulate 5-alpha reductase and are implicated in prostate enlargement" or something like that, instead of windows+mouse clicking your way through that. The next technical challenge is "how can I attach an LLM, say, llama, to be specific, to a large dataset of results, and/or to a machine-learning system that can extract new results and relationships?" I dunno. It's something I also want to work on. It's high up on my todo list, due to other conversations outside of this particular one. So I might be able to help/work on such a task, because it's .. generically desired by many people. But everything is up in the air, and sorting through priorities is .. hard, and I'm just one person with no money and no staff. The "opencog community" never really gelled, because this stuff is just too complicated and we don't have benchmark figures for people who are shopping around for benchmarks. -- Linas > > Kind regards, > > Abu > > On Tue, 7 Jan 2025 at 17:47, Linas Vepstas <[email protected]> wrote: >> >> On Tue, Jan 7, 2025 at 3:46 AM Abu Naser <[email protected]> wrote: >> > >> > I am interested in applying agi in genomics. Is there any tutorial on how >> > to build models, etc. ? >> >> OpenCog is not AGI, since that doesn't exist. Although everyone says >> they are working on it. OpenCog is a system for implementing various >> aspects of AGI: exploring, experimenting, tinkering. >> >> OpenCog has a set of components, ranging from rock-solid, stable, >> high-performance, to buggy, incomplete, abandoned. >> >> At the stable end is the AtomSpace, which is a way for storing >> anything in any way: vectors, dense networks, sparse networks, graphs, >> things that flow or change in time, whatever. It has been used for >> storing genomic and proteomic data, and the reactomes connecting them. >> I did look at that code: the core storage idea seemed fine. Some of >> the processing algorithms were poorly designed. I was called in for >> emergency repairs on one: after a month's worth of work, I got it to >> run 200x faster. That's right, two-hundred times. Unfortunately, by >> then, the client lost interest. The moral of the story is that >> software engineering matters: just cause its whiz-bang AI doesn't mean >> you can ignore basic design principles. So it goes. >> >> That project was mining for small reactome networks: for example, >> given one gene and one protein, find one other gene, two up/down >> regulators, and one other (I don't know, I'm not a geneticist) that >> formed a loop, or a star-shape, or something. The issue was that these >> sometimes could be found in a second or two, and sometimes it would >> take an hour of data-mining, which was annoying for the geneticists >> who just wanted the answer but didn't want to wait an hour. Of course, >> as the reaction network moved from 4 or 5 interactions, to 6 or 8, >> there was a combinatorial explosion. >> >> The reason for this was that that system performed an exhaustive >> search: it literally tried every possible combination, so that even >> obscure, opaque and thus "novel" combinations would be found. The >> deep-learning neural nets provide an alternative to exhaustive search. >> However, no one has hooked up a deep learning net for genomics into >> opencog, so you will not get lucky, there. >> >> MOSES (that you had trouble building) is a system for discovering >> pattern correlations in datasets. One project applied it to find a >> list of about 100 or 200 genes that correlated with long lifespans. >> The code, the adapter that did that was either proprietary, or was >> lost to the sands of time. >> >> I've been working on a tool for pattern discovery. In principle ("in >> theory") it could be used for genomics data. In practice, this would >> require adapters, shims and rejiggering. >> >> And so what? You use it, you can find some patterns, some >> correlations, and so what? There must be a zillion patterns and >> correlations in genomic data, so you have to be more focused than >> that. >> >> Some parts of the AI world talk about building "virtual scientists" >> that can "create hypotheses and test them". OpenCog does not do this. >> >> Creating an AI scientist that automatically makes discoveries sounds >> really cool! An exciting and new shiny future of AI machine >> scientists! But for one thing: the mathematicians have already tried >> this. >> >> Math is crisp enough that it is very easy to "create hypotheses and >> test them". They're called "theorems", >> and you test them with "theorem provers". Turns out that 99.999% of >> all theorems are boring (to humans). Yes, it might be true that X+Y=Z, >> but who cares? So what? >> >> I suspect a similar problem applies to genomics. Yes, someday, we >> might have AI scientists making "profound" discoveries, but the "so >> what?" question lingers. Unless that discovery is very specific: "take >> these pills, eat these foods and exercise regularly, you will become >> smarter and have a longer healthspan", that discovery is useless, in >> and of itself. >> >> There is a way out. In science, it turns out that making discoveries >> is hard, but once you have them, you can remember them, so you don't >> have to re-discover from scratch. You write them down in textbooks, >> teach the next generation, who then takes those discoveries and >> recombines them to make new discoveries. In mathematics, these are >> called "oracles": you have a question, the oracle can answer them >> instantly. Now, you can't actually build the pure mathematical >> definition of an oracle, but if you pretend you can, you can make >> deductions that are otherwise hard. >> >> If you can collect all the hard-to-find interrelations in genetics, so >> that the next time around it's instant and easy, then .. ? >> >> Let amble down that path. The various LLM's -- ChatGPT, and the OpenAI >> stuff and the Gemini from google are question-oracle-like things. You >> can ask questions, and get answers. OpenCog does NOT have one of >> these, and certainly not one optimized for genomics questions. If >> you want a natural language, chatbot interface to your genomics >> oracle, OpenCog is not the thing. Because OpenCog does not have >> chatbot natural language interfaces to its tools: the tools are all >> old-style, "Dr. DOS Prompt", and not windows-n-mouse interfaces, and >> certainly not LLM chatbots. Alas. >> >> Could you hook up an LLM-based chatbot to a large dataset of genomics >> data (using, for example, the OpenCog AtomSpace to hold it, and >> various tools to data-mine it?) I guess you could. But no one has done >> this, and this would be a large project. Not something you'd >> accomplish in a week or two of tinkering. >> >> -- linas >> >> > >> > Kind regards, >> > >> > Abu >> > >> > >> > >> > On Tue, 7 Jan 2025 at 03:55, Linas Vepstas <[email protected]> wrote: >> >> >> >> Hi Abu, >> >> >> >> I just merged a fix into as-moses which I think will solve the build >> >> problem you had. Try `git pull` on as-moses and with luck, the problem >> >> will be gone. >> >> >> >> --linas >> >> >> >> On Mon, Jan 6, 2025 at 5:56 PM Linas Vepstas <[email protected]> >> >> wrote: >> >> > >> >> > I can't reproduce this problem, so I will need your help. Try changing >> >> > bind to std::bind and changing _2 to std::placeholders::_2 >> >> > >> >> > If that doesn't fix it, try try changing the two std's to boost, so, >> >> > boost::bind and boost::placeholders >> >> > >> >> > Boost has been the source of ongoing breakage, and the decision to use >> >> > it was a mistake. So it goes. >> >> > >> >> > --linas >> >> > >> >> > On Mon, Jan 6, 2025 at 3:49 PM Abu Naser <[email protected]> wrote: >> >> > > >> >> > > Hi Linas, >> >> > > >> >> > > I have another error while I was installing asmoses: >> >> > > /asmoses/opencog/asmoses/reduct/reduct/flat_normal_form.cc:34:36: >> >> > > error: call of overloaded ‘bind(std::negate<int>, const >> >> > > boost::arg<2>&)’ is ambiguous >> >> > > 34 | bind(std::negate<int>(), _2))) != c.end()); >> >> > > >> >> > > Please let me know if you have any solution for this issue. >> >> > > >> >> > > Kind regards, >> >> > > Abu >> >> > > >> >> > > On Mon, 6 Jan 2025 at 20:06, Abu Naser <[email protected]> wrote: >> >> > >> >> >> > >> Thank you Linas. It works now. >> >> > >> >> >> > >> Kind regards, >> >> > >> >> >> > >> Abu >> >> > >> >> >> > >> On Mon, 6 Jan 2025 at 19:41, Linas Vepstas <[email protected]> >> >> > >> wrote: >> >> > >>> >> >> > >>> Hi Abu, >> >> > >>> >> >> > >>> class concurrent_set is provided by cogutils -- the solution would >> >> > >>> be to got to >> >> > >>> cd cogutils, git pull, rebuild and reinstall. Then the atomspace >> >> > >>> should build. See here: >> >> > >>> >> >> > >>> https://github.com/opencog/cogutil/blob/be54bfcadaf8439f324cf525781b254c87fa0722/opencog/util/concurrent_set.h#L162-L168 >> >> > >>> >> >> > >>> --linas >> >> > >>> >> >> > >>> On Sat, Jan 4, 2025 at 6:11 AM Abu Naser <[email protected]> >> >> > >>> wrote: >> >> > >>> > >> >> > >>> > Hi Everyone, >> >> > >>> > >> >> > >>> > The following error is thrown while I was compiling atomspace on >> >> > >>> > Ubuntu: >> >> > >>> > >> >> > >>> > opencog_repos/atomspace/opencog/persist/proxy/WriteBufferProxy.cc:85:14: >> >> > >>> > error: ‘class concurrent_set<opencog::Handle>’ has no member >> >> > >>> > named ‘clear’ >> >> > >>> > 85 | _atom_queue.clear(); >> >> > >>> > >> >> > >>> > >> >> > >>> > Is there any solution for this error? >> >> > >>> > >> >> > >>> > >> >> > >>> > Kind regards, >> >> > >>> > >> >> > >>> > Abu >> >> > >>> > >> >> > >>> > -- >> >> > >>> > You received this message because you are subscribed to the >> >> > >>> > Google Groups "opencog" group. >> >> > >>> > To unsubscribe from this group and stop receiving emails from it, >> >> > >>> > send an email to [email protected]. >> >> > >>> > To view this discussion visit >> >> > >>> > https://groups.google.com/d/msgid/opencog/CAMw3wdg6zMZgwF0hwk_ibqHuMyc9EC30qsJQPbRwmqEnexXLNg%40mail.gmail.com. >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> -- >> >> > >>> Patrick: Are they laughing at us? >> >> > >>> Sponge Bob: No, Patrick, they are laughing next to us. >> >> > >>> >> >> > >>> -- >> >> > >>> You received this message because you are subscribed to the Google >> >> > >>> Groups "opencog" group. >> >> > >>> To unsubscribe from this group and stop receiving emails from it, >> >> > >>> send an email to [email protected]. >> >> > >>> To view this discussion visit >> >> > >>> https://groups.google.com/d/msgid/opencog/CAHrUA35N%2BhqA8CKtMBU4wJhmUQ8xKDivQHpx7%3DbdrZ9K_txg6Q%40mail.gmail.com. >> >> > > >> >> > > -- >> >> > > You received this message because you are subscribed to the Google >> >> > > Groups "opencog" group. >> >> > > To unsubscribe from this group and stop receiving emails from it, >> >> > > send an email to [email protected]. >> >> > > To view this discussion visit >> >> > > https://groups.google.com/d/msgid/opencog/CAMw3wdjKdP7tfgxReFeXJ8z7sEt9x53pP0VMUzttL8xxE9%3Djag%40mail.gmail.com. >> >> > >> >> > >> >> > >> >> > -- >> >> > Patrick: Are they laughing at us? >> >> > Sponge Bob: No, Patrick, they are laughing next to us. >> >> >> >> >> >> >> >> -- >> >> Patrick: Are they laughing at us? >> >> Sponge Bob: No, Patrick, they are laughing next to us. >> >> >> >> -- >> >> You received this message because you are subscribed to the Google Groups >> >> "opencog" group. >> >> To unsubscribe from this group and stop receiving emails from it, send an >> >> email to [email protected]. >> >> To view this discussion visit >> >> https://groups.google.com/d/msgid/opencog/CAHrUA36fMLjzP%3D2%2Bzjr5BLf3qPdCOUfBv2-D7pttzCb1sjnodw%40mail.gmail.com. >> > >> > -- >> > You received this message because you are subscribed to the Google Groups >> > "opencog" group. >> > To unsubscribe from this group and stop receiving emails from it, send an >> > email to [email protected]. >> > To view this discussion visit >> > https://groups.google.com/d/msgid/opencog/CAMw3wdgntkGmEFpZrs8wspfv8vFanEAT-6W-tbCQAt25-NQVyw%40mail.gmail.com. >> >> >> >> -- >> Patrick: Are they laughing at us? >> Sponge Bob: No, Patrick, they are laughing next to us. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "opencog" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion visit >> https://groups.google.com/d/msgid/opencog/CAHrUA36XK%3DtaeydW5WT4KtqCNa8eJ%2BzsFVWZJBTYv_3%2Bws3rDg%40mail.gmail.com. > > -- > You received this message because you are subscribed to the Google Groups > "opencog" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAMw3wdjPeqSZ2APH9U6y-qR_t2Z3rtTUqdXC7GZ%2BVAen1ABb3A%40mail.gmail.com. -- Patrick: Are they laughing at us? Sponge Bob: No, Patrick, they are laughing next to us. -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/opencog/CAHrUA35KUKC6hr-cB_ALJPVQC%3DACbAxrhhwRBmxA1iFvOUjseQ%40mail.gmail.com.
