Hi Linas, Good to hear from you. I have done some googling about the LLM, I have found many people are using LLM for analysing genomic data. (https://github.com/MAGICS-LAB/DNABERT_2?tab=readme-ov-file that can easily be used via https://huggingface.co/docs/transformers/en/index) Their approach is usual, 1st train a model and then use it to predict. In our case, where do we get the knowledge to store on atomspace? I can certainly to do some reading on their work and figure out how they do it.
Do you have the pattern matching tool set in github? I am a command line person. I would not mind even if it is a bit messy. I am a biologist by training but professionally I don't do biology. It would be fun for me to do some biology on the sideline of my profession. My shortcoming is that I am not a good coder. Hope to hear from you soon, Abu On Wed, 8 Jan 2025 at 01:03, Linas Vepstas <[email protected]> wrote: > Hi Abu, > > Let me respond in reverse order. > > On Tue, Jan 7, 2025 at 2:57 PM Abu Naser <[email protected]> wrote: > > > > Thank you very much for your very informative email. Among topics you > mentioned, following two sounds interesting: > > > > 1) Pattern discovery > > 2) Hooking up an LLM-based chatbot to a large genomics data. > > > > What tools do you have for pattern discovery? > > Regarding LLM-based chatbot, is it expected to implement LLM chatbot > from the scratch? > > "From scratch" sounds so pessimistic. A good place to start would be > Llama https://en.wikipedia.org/wiki/Llama_(language_model) -- the > models are freely available, the code to generate more is GPL'ed. I'm > unclear about what sort of compute resources are needed to deploy. > > So let me hop back to item 1. Here's how I do pattern discovery, > personally, on my own pet project. "There are many, but this one is > mine". So. I start with a system that obtains pair-wise correlations > between "things", could be anything, as long as they can be tagged. > This generates high-dimensional sparse vectors. So, if you have a > million "things", then there is a N=1 million-dimensional vector, > since, for any one item, there might be any of 999,999 others it might > be related to. It is a sparse vector, because most of the other > relations are zero. (The atomspace is highly optimized for storing > sparse vectors) > > These vectors exhibit all the classical properties of vector > embeddings: for example, the classic "king - man + woman = queen" > embedding pops up trivially, without any work. > > But pair-wise correlations are boring and old-hat, so my next step is > to create tensors (I often call them "jigsaws", but people react > negatively to that term. Meanwhile, the word "tensor" has a > sophisticated sheen of respectability to it, even though the tensors > of general relativity, and quantum mechanics, and .. neural nets, are > all exactly jigsaws.) A tensor is, specifically, a segment of a > network graph, where some of the connecting edges have been cut, to > create a disconnected graph component. The cut edges are not > discarded, but are instead tagged with a type marker, so that they > could be reconnected, if/when desired. This is the "jigsaw". > > So then I look for pairwise correlations between tensors, to create a > vector of tensors. Lather, rinse, repeat. > > Well, not quite: in between is a clustering step: most of these > (vectors of) tensors look similar to one-another, in that they connect > in similar ways. An example from genomics would be a gene that has > similar function in mammals and insects, perhaps because it's highly > conserved, or whatever. Judging similarity is done using vector > products, Cosine dot products are a reasonable start; but I like > certain information-theoretic, Bayesian-style products better. But > they're all in the same ballpark. The clustering step is part of the > "information discovery" or "pattern mining" of the process: > classifying similar things. > > I do the classification step before the second pair-wise step. So, > vectorize, classify, tensorize, classify, repeat. Tensors can be > contracted, so the last step re-assembles the network connections, > this time using the generic, abstracted classes, instead of the > specific, concrete exemplars. > > In genomics, it would be like saying "these kinds of genes, as a > class, interact with those kinds of genes, as a class, and up/down > regulate or express these kinds of proteins, as a class". The class > may be a class of one. I find the size of the classes have a > square-root-zipfian distribution. Why? I don't know. I have measured > this for genes, proteins; someone once gave me a dataset, years ago > > The goal of pairing, tensoring, classifying, and then doing it again > is to ladder my way up to large-scale, complex structures, built out > of small-scale itty-bitty structures. For example, say, discovering > how different variants of a Krebs cycle interact with other cycles and > regulatory mechanisms in different species, or something like that. > (I'm NOT a biology/genetics guy, I'm making up things I imagine might > be interesting to you, for illustration.) I think it's a cool idea, > but very few are enthused by it. > > There are many practical problems. Foremost is that I don't have a > mouse-n-windows interface to this. You cannot just click on this item > and ask "show me all structural relationships that this participates > in", which is what people want. > > Second is that this is implemented in an ad-hoc collection of software > bits-pieces-parts. Some of those pieces are highly polished, carefully > documented, fully debugged. Others are duct-tape and string. The > process is a batch process: press a button, wait a few hours, a day or > a week, get gigabytes of results, and then feed it to the next stage. > And this is where I got tangled and lost. The next stage, the > recursive step, seems to work great, but the batch processing is > killing me. I've got hundreds of ten-gigabyte-sized datasets, each > with different properties, different defects, different results, > incompatible with the last batch, etc. Waiting a week for an > experiment to run, only to realize there was a mistake in the > pipeline, or that I should tune some parameter .. it's a mind-killer. > I thought to move away from batch, to stream processing, and then got > bogged down. > > Then the meta-question: I personally think that this is a great way of > extracting hierarchical structure from complex networks. But there are > people who want to see accuracy scores on standard benchmarks, so that > they can compare to their favorite horse in the horse-race ... Gah. > I'm not racing horses in a horse race; I'm trying to understand how > horses work. For starters, a leg at each corner. > https://m.media-amazon.com/images/I/51deqr5XrYL._SY445_SX342_.jpg > > Would this process create anything useful for you, for genomics? I > dunno. Maybe, maybe not. > > Back to part two. > > The LLM API would be verbal: "find all genes that upregulate 5-alpha > reductase and are implicated in prostate enlargement" or something > like that, instead of windows+mouse clicking your way through that. > > The next technical challenge is "how can I attach an LLM, say, llama, > to be specific, to a large dataset of results, and/or to a > machine-learning system that can extract new results and > relationships?" I dunno. It's something I also want to work on. It's > high up on my todo list, due to other conversations outside of this > particular one. So I might be able to help/work on such a task, > because it's .. generically desired by many people. But everything is > up in the air, and sorting through priorities is .. hard, and I'm just > one person with no money and no staff. The "opencog community" never > really gelled, because this stuff is just too complicated and we don't > have benchmark figures for people who are shopping around for > benchmarks. > > -- Linas > > > > > Kind regards, > > > > Abu > > > > On Tue, 7 Jan 2025 at 17:47, Linas Vepstas <[email protected]> > wrote: > >> > >> On Tue, Jan 7, 2025 at 3:46 AM Abu Naser <[email protected]> wrote: > >> > > >> > I am interested in applying agi in genomics. Is there any tutorial on > how to build models, etc. ? > >> > >> OpenCog is not AGI, since that doesn't exist. Although everyone says > >> they are working on it. OpenCog is a system for implementing various > >> aspects of AGI: exploring, experimenting, tinkering. > >> > >> OpenCog has a set of components, ranging from rock-solid, stable, > >> high-performance, to buggy, incomplete, abandoned. > >> > >> At the stable end is the AtomSpace, which is a way for storing > >> anything in any way: vectors, dense networks, sparse networks, graphs, > >> things that flow or change in time, whatever. It has been used for > >> storing genomic and proteomic data, and the reactomes connecting them. > >> I did look at that code: the core storage idea seemed fine. Some of > >> the processing algorithms were poorly designed. I was called in for > >> emergency repairs on one: after a month's worth of work, I got it to > >> run 200x faster. That's right, two-hundred times. Unfortunately, by > >> then, the client lost interest. The moral of the story is that > >> software engineering matters: just cause its whiz-bang AI doesn't mean > >> you can ignore basic design principles. So it goes. > >> > >> That project was mining for small reactome networks: for example, > >> given one gene and one protein, find one other gene, two up/down > >> regulators, and one other (I don't know, I'm not a geneticist) that > >> formed a loop, or a star-shape, or something. The issue was that these > >> sometimes could be found in a second or two, and sometimes it would > >> take an hour of data-mining, which was annoying for the geneticists > >> who just wanted the answer but didn't want to wait an hour. Of course, > >> as the reaction network moved from 4 or 5 interactions, to 6 or 8, > >> there was a combinatorial explosion. > >> > >> The reason for this was that that system performed an exhaustive > >> search: it literally tried every possible combination, so that even > >> obscure, opaque and thus "novel" combinations would be found. The > >> deep-learning neural nets provide an alternative to exhaustive search. > >> However, no one has hooked up a deep learning net for genomics into > >> opencog, so you will not get lucky, there. > >> > >> MOSES (that you had trouble building) is a system for discovering > >> pattern correlations in datasets. One project applied it to find a > >> list of about 100 or 200 genes that correlated with long lifespans. > >> The code, the adapter that did that was either proprietary, or was > >> lost to the sands of time. > >> > >> I've been working on a tool for pattern discovery. In principle ("in > >> theory") it could be used for genomics data. In practice, this would > >> require adapters, shims and rejiggering. > >> > >> And so what? You use it, you can find some patterns, some > >> correlations, and so what? There must be a zillion patterns and > >> correlations in genomic data, so you have to be more focused than > >> that. > >> > >> Some parts of the AI world talk about building "virtual scientists" > >> that can "create hypotheses and test them". OpenCog does not do this. > >> > >> Creating an AI scientist that automatically makes discoveries sounds > >> really cool! An exciting and new shiny future of AI machine > >> scientists! But for one thing: the mathematicians have already tried > >> this. > >> > >> Math is crisp enough that it is very easy to "create hypotheses and > >> test them". They're called "theorems", > >> and you test them with "theorem provers". Turns out that 99.999% of > >> all theorems are boring (to humans). Yes, it might be true that X+Y=Z, > >> but who cares? So what? > >> > >> I suspect a similar problem applies to genomics. Yes, someday, we > >> might have AI scientists making "profound" discoveries, but the "so > >> what?" question lingers. Unless that discovery is very specific: "take > >> these pills, eat these foods and exercise regularly, you will become > >> smarter and have a longer healthspan", that discovery is useless, in > >> and of itself. > >> > >> There is a way out. In science, it turns out that making discoveries > >> is hard, but once you have them, you can remember them, so you don't > >> have to re-discover from scratch. You write them down in textbooks, > >> teach the next generation, who then takes those discoveries and > >> recombines them to make new discoveries. In mathematics, these are > >> called "oracles": you have a question, the oracle can answer them > >> instantly. Now, you can't actually build the pure mathematical > >> definition of an oracle, but if you pretend you can, you can make > >> deductions that are otherwise hard. > >> > >> If you can collect all the hard-to-find interrelations in genetics, so > >> that the next time around it's instant and easy, then .. ? > >> > >> Let amble down that path. The various LLM's -- ChatGPT, and the OpenAI > >> stuff and the Gemini from google are question-oracle-like things. You > >> can ask questions, and get answers. OpenCog does NOT have one of > >> these, and certainly not one optimized for genomics questions. If > >> you want a natural language, chatbot interface to your genomics > >> oracle, OpenCog is not the thing. Because OpenCog does not have > >> chatbot natural language interfaces to its tools: the tools are all > >> old-style, "Dr. DOS Prompt", and not windows-n-mouse interfaces, and > >> certainly not LLM chatbots. Alas. > >> > >> Could you hook up an LLM-based chatbot to a large dataset of genomics > >> data (using, for example, the OpenCog AtomSpace to hold it, and > >> various tools to data-mine it?) I guess you could. But no one has done > >> this, and this would be a large project. Not something you'd > >> accomplish in a week or two of tinkering. > >> > >> -- linas > >> > >> > > >> > Kind regards, > >> > > >> > Abu > >> > > >> > > >> > > >> > On Tue, 7 Jan 2025 at 03:55, Linas Vepstas <[email protected]> > wrote: > >> >> > >> >> Hi Abu, > >> >> > >> >> I just merged a fix into as-moses which I think will solve the build > >> >> problem you had. Try `git pull` on as-moses and with luck, the > problem > >> >> will be gone. > >> >> > >> >> --linas > >> >> > >> >> On Mon, Jan 6, 2025 at 5:56 PM Linas Vepstas <[email protected]> > wrote: > >> >> > > >> >> > I can't reproduce this problem, so I will need your help. Try > changing > >> >> > bind to std::bind and changing _2 to std::placeholders::_2 > >> >> > > >> >> > If that doesn't fix it, try try changing the two std's to boost, > so, > >> >> > boost::bind and boost::placeholders > >> >> > > >> >> > Boost has been the source of ongoing breakage, and the decision to > use > >> >> > it was a mistake. So it goes. > >> >> > > >> >> > --linas > >> >> > > >> >> > On Mon, Jan 6, 2025 at 3:49 PM Abu Naser <[email protected]> > wrote: > >> >> > > > >> >> > > Hi Linas, > >> >> > > > >> >> > > I have another error while I was installing asmoses: > >> >> > > > /asmoses/opencog/asmoses/reduct/reduct/flat_normal_form.cc:34:36: error: > call of overloaded ‘bind(std::negate<int>, const boost::arg<2>&)’ is > ambiguous > >> >> > > 34 | bind(std::negate<int>(), _2))) != c.end()); > >> >> > > > >> >> > > Please let me know if you have any solution for this issue. > >> >> > > > >> >> > > Kind regards, > >> >> > > Abu > >> >> > > > >> >> > > On Mon, 6 Jan 2025 at 20:06, Abu Naser <[email protected]> > wrote: > >> >> > >> > >> >> > >> Thank you Linas. It works now. > >> >> > >> > >> >> > >> Kind regards, > >> >> > >> > >> >> > >> Abu > >> >> > >> > >> >> > >> On Mon, 6 Jan 2025 at 19:41, Linas Vepstas < > [email protected]> wrote: > >> >> > >>> > >> >> > >>> Hi Abu, > >> >> > >>> > >> >> > >>> class concurrent_set is provided by cogutils -- the solution > would be to got to > >> >> > >>> cd cogutils, git pull, rebuild and reinstall. Then the > atomspace > >> >> > >>> should build. See here: > >> >> > >>> > >> >> > >>> > https://github.com/opencog/cogutil/blob/be54bfcadaf8439f324cf525781b254c87fa0722/opencog/util/concurrent_set.h#L162-L168 > >> >> > >>> > >> >> > >>> --linas > >> >> > >>> > >> >> > >>> On Sat, Jan 4, 2025 at 6:11 AM Abu Naser <[email protected]> > wrote: > >> >> > >>> > > >> >> > >>> > Hi Everyone, > >> >> > >>> > > >> >> > >>> > The following error is thrown while I was compiling > atomspace on Ubuntu: > >> >> > >>> > > >> >> > >>> > > opencog_repos/atomspace/opencog/persist/proxy/WriteBufferProxy.cc:85:14: > error: ‘class concurrent_set<opencog::Handle>’ has no member named ‘clear’ > >> >> > >>> > 85 | _atom_queue.clear(); > >> >> > >>> > > >> >> > >>> > > >> >> > >>> > Is there any solution for this error? > >> >> > >>> > > >> >> > >>> > > >> >> > >>> > Kind regards, > >> >> > >>> > > >> >> > >>> > Abu > >> >> > >>> > > >> >> > >>> > -- > >> >> > >>> > You received this message because you are subscribed to the > Google Groups "opencog" group. > >> >> > >>> > To unsubscribe from this group and stop receiving emails > from it, send an email to [email protected]. > >> >> > >>> > To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAMw3wdg6zMZgwF0hwk_ibqHuMyc9EC30qsJQPbRwmqEnexXLNg%40mail.gmail.com > . > >> >> > >>> > >> >> > >>> > >> >> > >>> > >> >> > >>> -- > >> >> > >>> Patrick: Are they laughing at us? > >> >> > >>> Sponge Bob: No, Patrick, they are laughing next to us. > >> >> > >>> > >> >> > >>> -- > >> >> > >>> You received this message because you are subscribed to the > Google Groups "opencog" group. > >> >> > >>> To unsubscribe from this group and stop receiving emails from > it, send an email to [email protected]. > >> >> > >>> To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAHrUA35N%2BhqA8CKtMBU4wJhmUQ8xKDivQHpx7%3DbdrZ9K_txg6Q%40mail.gmail.com > . > >> >> > > > >> >> > > -- > >> >> > > You received this message because you are subscribed to the > Google Groups "opencog" group. > >> >> > > To unsubscribe from this group and stop receiving emails from > it, send an email to [email protected]. > >> >> > > To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAMw3wdjKdP7tfgxReFeXJ8z7sEt9x53pP0VMUzttL8xxE9%3Djag%40mail.gmail.com > . > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Patrick: Are they laughing at us? > >> >> > Sponge Bob: No, Patrick, they are laughing next to us. > >> >> > >> >> > >> >> > >> >> -- > >> >> Patrick: Are they laughing at us? > >> >> Sponge Bob: No, Patrick, they are laughing next to us. > >> >> > >> >> -- > >> >> You received this message because you are subscribed to the Google > Groups "opencog" group. > >> >> To unsubscribe from this group and stop receiving emails from it, > send an email to [email protected]. > >> >> To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAHrUA36fMLjzP%3D2%2Bzjr5BLf3qPdCOUfBv2-D7pttzCb1sjnodw%40mail.gmail.com > . > >> > > >> > -- > >> > You received this message because you are subscribed to the Google > Groups "opencog" group. > >> > To unsubscribe from this group and stop receiving emails from it, > send an email to [email protected]. > >> > To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAMw3wdgntkGmEFpZrs8wspfv8vFanEAT-6W-tbCQAt25-NQVyw%40mail.gmail.com > . > >> > >> > >> > >> -- > >> Patrick: Are they laughing at us? > >> Sponge Bob: No, Patrick, they are laughing next to us. > >> > >> -- > >> You received this message because you are subscribed to the Google > Groups "opencog" group. > >> To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected]. > >> To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAHrUA36XK%3DtaeydW5WT4KtqCNa8eJ%2BzsFVWZJBTYv_3%2Bws3rDg%40mail.gmail.com > . > > > > -- > > You received this message because you are subscribed to the Google > Groups "opencog" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected]. > > To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAMw3wdjPeqSZ2APH9U6y-qR_t2Z3rtTUqdXC7GZ%2BVAen1ABb3A%40mail.gmail.com > . > > > > -- > Patrick: Are they laughing at us? > Sponge Bob: No, Patrick, they are laughing next to us. > > -- > You received this message because you are subscribed to the Google Groups > "opencog" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/opencog/CAHrUA35KUKC6hr-cB_ALJPVQC%3DACbAxrhhwRBmxA1iFvOUjseQ%40mail.gmail.com > . > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/opencog/CAMw3wdhx9BfcpRFSWMCsz%2Bor0F6UAJ9Y3sxmv9g-VgC4aeJruA%40mail.gmail.com.
