Hi Linas,

Good to hear from you.
I have done some googling about the LLM, I have found many people are using
LLM for analysing genomic data.
(https://github.com/MAGICS-LAB/DNABERT_2?tab=readme-ov-file that can easily
be used via  https://huggingface.co/docs/transformers/en/index)
Their approach is usual, 1st train a model and then use it to predict. In
our case, where do we get the knowledge to store on atomspace?
I can certainly to do some reading on their work and figure out how they do
it.

Do you have the pattern matching tool set in github? I am a command line
person. I would not mind even if it is a bit messy. I am a biologist by
training but
professionally I don't do biology. It would be fun for me to do some
biology on the sideline of my profession. My shortcoming is that I am not a
good coder.

 Hope to hear from you soon,

Abu



On Wed, 8 Jan 2025 at 01:03, Linas Vepstas <[email protected]> wrote:

> Hi Abu,
>
> Let me respond in reverse order.
>
> On Tue, Jan 7, 2025 at 2:57 PM Abu Naser <[email protected]> wrote:
> >
> > Thank you very much for your very informative email. Among topics you
> mentioned, following two sounds interesting:
> >
> > 1) Pattern discovery
> > 2)  Hooking up an LLM-based chatbot to a large genomics data.
> >
> > What tools do you have for pattern discovery?
> > Regarding LLM-based chatbot, is it expected to implement  LLM chatbot
> from the scratch?
>
> "From scratch" sounds so pessimistic. A good place to start would be
> Llama https://en.wikipedia.org/wiki/Llama_(language_model) -- the
> models are freely available, the code to generate more is GPL'ed. I'm
> unclear about what sort of compute resources are needed to deploy.
>
> So let me hop back to item 1. Here's how I do pattern discovery,
> personally, on my own pet project. "There are many, but this one is
> mine". So. I start with a system that obtains pair-wise correlations
> between "things", could be anything, as long as they can be tagged.
> This generates high-dimensional sparse vectors. So, if you have a
> million "things", then there is a N=1 million-dimensional vector,
> since, for any one item, there might be any of 999,999 others it might
> be related to.  It is a sparse vector, because most of the other
> relations are zero. (The atomspace is highly optimized for storing
> sparse vectors)
>
> These vectors exhibit all the classical properties of vector
> embeddings: for example, the classic "king - man + woman = queen"
> embedding pops up trivially, without any work.
>
> But pair-wise correlations are boring and old-hat, so my next step is
> to create tensors (I often call them "jigsaws", but people react
> negatively to that term. Meanwhile, the word "tensor" has a
> sophisticated sheen of respectability to it, even though the tensors
> of general relativity, and quantum mechanics, and .. neural nets, are
> all exactly jigsaws.)  A tensor is, specifically, a segment of a
> network graph, where some of the connecting edges have been cut, to
> create a disconnected graph component. The cut edges are not
> discarded, but are instead tagged with a type marker, so that they
> could be reconnected, if/when desired. This is the "jigsaw".
>
> So then I look for pairwise correlations between tensors, to create a
> vector of tensors. Lather, rinse, repeat.
>
> Well, not quite: in between is a clustering step: most of these
> (vectors of) tensors look similar to one-another, in that they connect
> in similar ways. An example from genomics would be a gene that has
> similar function in mammals and insects, perhaps because it's highly
> conserved, or whatever.  Judging similarity is done using vector
> products, Cosine dot products are a reasonable start; but I like
> certain information-theoretic, Bayesian-style products better. But
> they're all in the same ballpark.  The clustering step is part of the
> "information discovery" or "pattern mining" of the process:
> classifying similar things.
>
> I do the classification step before the second pair-wise step. So,
> vectorize, classify, tensorize, classify, repeat. Tensors can be
> contracted, so the last step re-assembles the network connections,
> this time using the generic, abstracted classes, instead of the
> specific, concrete exemplars.
>
> In genomics, it would be like saying "these kinds of genes, as a
> class, interact with those kinds of genes, as a class, and up/down
> regulate or express these kinds of proteins, as a class". The class
> may be a class of one. I find the size of the classes have a
> square-root-zipfian distribution. Why? I don't know. I have measured
> this for genes, proteins; someone once gave me a dataset, years ago
>
> The goal of pairing, tensoring, classifying, and then doing it again
> is to ladder my way up to large-scale, complex structures, built out
> of small-scale itty-bitty structures. For example, say, discovering
> how different variants of a Krebs cycle interact with other cycles and
> regulatory mechanisms in different species, or something like that.
> (I'm NOT a biology/genetics guy, I'm making up things I imagine might
> be interesting to you, for illustration.)  I think it's a cool idea,
> but very few are enthused by it.
>
> There are many practical problems. Foremost is that I don't have a
> mouse-n-windows interface to this. You cannot just click on this item
> and ask "show me all structural relationships that this participates
> in", which is what people want.
>
> Second is that this is implemented in an ad-hoc collection of software
> bits-pieces-parts. Some of those pieces are highly polished, carefully
> documented, fully debugged. Others are duct-tape and string. The
> process is a batch process: press a button, wait a few hours, a day or
> a week, get gigabytes of results, and then feed it to the next stage.
> And this is where I got tangled and lost. The next stage, the
> recursive step, seems to work great, but the batch processing is
> killing me. I've got hundreds of ten-gigabyte-sized datasets, each
> with different properties, different defects, different results,
> incompatible with the last batch, etc. Waiting a week for an
> experiment to run, only to realize there was a mistake in the
> pipeline, or that I should tune some parameter .. it's a mind-killer.
> I thought to move away from batch, to stream processing, and then got
> bogged down.
>
> Then the meta-question: I personally think that this is a great way of
> extracting hierarchical structure from complex networks. But there are
> people who want to see accuracy scores on standard benchmarks, so that
> they can compare to their favorite horse in the horse-race ... Gah.
> I'm not racing horses in a horse race; I'm trying to understand how
> horses work. For starters, a leg at each corner.
> https://m.media-amazon.com/images/I/51deqr5XrYL._SY445_SX342_.jpg
>
> Would this process create anything useful for you, for genomics? I
> dunno. Maybe, maybe not.
>
> Back to part two.
>
> The LLM API would be verbal: "find all genes that upregulate 5-alpha
> reductase and are implicated in prostate enlargement" or something
> like that, instead of windows+mouse clicking your way through that.
>
> The next technical challenge is "how can I attach an LLM, say, llama,
> to be specific, to a large dataset of results, and/or to a
> machine-learning system that can extract new results and
> relationships?"  I dunno. It's something I also want to work on. It's
> high up on my todo list, due to other conversations outside of this
> particular one. So I might be able to help/work on such a task,
> because it's .. generically desired by many people. But everything is
> up in the air, and sorting through priorities is .. hard, and I'm just
> one person with no money and no staff. The "opencog community" never
> really gelled, because this stuff is just too complicated and we don't
> have benchmark figures for people who are shopping around for
> benchmarks.
>
> -- Linas
>
> >
> > Kind regards,
> >
> > Abu
> >
> > On Tue, 7 Jan 2025 at 17:47, Linas Vepstas <[email protected]>
> wrote:
> >>
> >> On Tue, Jan 7, 2025 at 3:46 AM Abu Naser <[email protected]> wrote:
> >> >
> >> > I am interested in applying agi in genomics. Is there any tutorial on
> how to build models, etc. ?
> >>
> >> OpenCog is not AGI, since that doesn't exist. Although everyone says
> >> they are working on it. OpenCog is a system for implementing various
> >> aspects of AGI: exploring, experimenting, tinkering.
> >>
> >> OpenCog has a set of components, ranging from rock-solid, stable,
> >> high-performance, to buggy, incomplete, abandoned.
> >>
> >> At the stable end is the AtomSpace, which is a way for storing
> >> anything in any way: vectors, dense networks, sparse networks, graphs,
> >> things that flow or change in time, whatever. It has been used for
> >> storing genomic and proteomic data, and the reactomes connecting them.
> >> I did look at that code: the core storage idea seemed fine. Some of
> >> the processing algorithms were poorly designed. I was called in for
> >> emergency repairs on one: after a month's worth of work, I got it to
> >> run 200x faster. That's right, two-hundred times. Unfortunately, by
> >> then, the client lost interest. The moral of the story is that
> >> software engineering matters: just cause its whiz-bang AI doesn't mean
> >> you can ignore basic design principles. So it goes.
> >>
> >> That project was mining for small reactome networks: for example,
> >> given one gene and one protein, find one other gene, two up/down
> >> regulators, and one other (I don't know, I'm not a geneticist) that
> >> formed a loop, or a star-shape, or something. The issue was that these
> >> sometimes could be found in  a second or two, and sometimes it would
> >> take an hour of data-mining, which was annoying for the geneticists
> >> who just wanted the answer but didn't want to wait an hour. Of course,
> >> as the reaction network moved from 4 or 5 interactions, to 6 or 8,
> >> there was a combinatorial explosion.
> >>
> >> The reason for this was that that system performed an exhaustive
> >> search: it literally tried every possible combination, so that even
> >> obscure, opaque and thus "novel" combinations would be found.  The
> >> deep-learning neural nets provide an alternative to exhaustive search.
> >> However, no one has hooked up a deep learning net for genomics into
> >> opencog, so you will not get lucky, there.
> >>
> >> MOSES (that you had trouble building) is a system for discovering
> >> pattern correlations in datasets. One project applied it to find a
> >> list of about 100 or 200 genes that correlated with long lifespans.
> >> The code, the adapter that did that was either proprietary, or was
> >> lost to the sands of time.
> >>
> >> I've been working on a tool for pattern discovery. In principle ("in
> >> theory") it could be used for genomics data. In practice, this would
> >> require adapters, shims and rejiggering.
> >>
> >> And so what? You use it, you can find some patterns, some
> >> correlations, and so what? There must be a zillion patterns and
> >> correlations in genomic data, so you have to be more focused than
> >> that.
> >>
> >> Some parts of the AI world talk about building "virtual scientists"
> >> that can "create hypotheses and test them". OpenCog does not do this.
> >>
> >> Creating an AI scientist that automatically makes discoveries sounds
> >> really cool! An exciting and new shiny future of AI machine
> >> scientists! But for one thing: the mathematicians have already tried
> >> this.
> >>
> >> Math is crisp enough that it is very easy to "create hypotheses and
> >> test them". They're called "theorems",
> >> and you test them with "theorem provers".  Turns out that 99.999% of
> >> all theorems are boring (to humans). Yes, it might be true that X+Y=Z,
> >> but who cares? So what?
> >>
> >> I suspect a similar problem applies to genomics. Yes, someday, we
> >> might have AI scientists making "profound" discoveries, but the "so
> >> what?" question lingers. Unless that discovery is very specific: "take
> >> these pills, eat these foods and exercise regularly, you will become
> >> smarter and have a longer healthspan", that discovery is useless, in
> >> and of itself.
> >>
> >> There is a way out. In science, it turns out that making discoveries
> >> is hard, but once you have them, you can remember them, so you don't
> >> have to re-discover from scratch. You write them down in textbooks,
> >> teach the next generation, who then takes those discoveries and
> >> recombines them to make new discoveries. In mathematics, these are
> >> called "oracles": you have a question, the oracle can answer them
> >> instantly. Now, you can't actually build the pure mathematical
> >> definition of an oracle, but if you pretend you can, you can make
> >> deductions that are otherwise hard.
> >>
> >> If you can collect all the hard-to-find interrelations in genetics, so
> >> that the next time around it's instant and easy, then .. ?
> >>
> >> Let amble down that path. The various LLM's -- ChatGPT, and the OpenAI
> >> stuff and the Gemini from google are question-oracle-like things. You
> >> can ask questions, and get answers. OpenCog does NOT have one of
> >> these, and certainly not one optimized for genomics questions.   If
> >> you want a natural language, chatbot interface to your genomics
> >> oracle, OpenCog is not the thing. Because OpenCog does not have
> >> chatbot natural language interfaces to its tools: the tools are all
> >> old-style, "Dr. DOS Prompt", and not  windows-n-mouse interfaces, and
> >> certainly not LLM chatbots. Alas.
> >>
> >> Could you hook up an LLM-based chatbot to a large dataset of genomics
> >> data (using, for example, the OpenCog AtomSpace to hold it, and
> >> various tools to data-mine it?) I guess you could. But no one has done
> >> this, and this would be a large project. Not something you'd
> >> accomplish in a week or two of tinkering.
> >>
> >> -- linas
> >>
> >> >
> >> > Kind regards,
> >> >
> >> > Abu
> >> >
> >> >
> >> >
> >> > On Tue, 7 Jan 2025 at 03:55, Linas Vepstas <[email protected]>
> wrote:
> >> >>
> >> >> Hi Abu,
> >> >>
> >> >> I just merged a fix into as-moses which I think will solve the build
> >> >> problem you had. Try `git pull` on as-moses and with luck, the
> problem
> >> >> will be gone.
> >> >>
> >> >> --linas
> >> >>
> >> >> On Mon, Jan 6, 2025 at 5:56 PM Linas Vepstas <[email protected]>
> wrote:
> >> >> >
> >> >> > I can't reproduce this problem, so I will need your help. Try
> changing
> >> >> > bind to std::bind  and changing _2 to std::placeholders::_2
> >> >> >
> >> >> > If that doesn't fix it, try try changing the two std's to boost,
> so,
> >> >> > boost::bind and boost::placeholders
> >> >> >
> >> >> > Boost has been the source of ongoing breakage, and the decision to
> use
> >> >> > it was a mistake. So it goes.
> >> >> >
> >> >> > --linas
> >> >> >
> >> >> > On Mon, Jan 6, 2025 at 3:49 PM Abu Naser <[email protected]>
> wrote:
> >> >> > >
> >> >> > > Hi Linas,
> >> >> > >
> >> >> > > I have another error while I was installing asmoses:
> >> >> > >
> /asmoses/opencog/asmoses/reduct/reduct/flat_normal_form.cc:34:36: error:
> call of overloaded ‘bind(std::negate<int>, const boost::arg<2>&)’ is
> ambiguous
> >> >> > >    34 |         bind(std::negate<int>(), _2))) != c.end());
> >> >> > >
> >> >> > > Please let me know if you have any solution for this issue.
> >> >> > >
> >> >> > > Kind regards,
> >> >> > > Abu
> >> >> > >
> >> >> > > On Mon, 6 Jan 2025 at 20:06, Abu Naser <[email protected]>
> wrote:
> >> >> > >>
> >> >> > >> Thank you Linas. It works now.
> >> >> > >>
> >> >> > >> Kind regards,
> >> >> > >>
> >> >> > >> Abu
> >> >> > >>
> >> >> > >> On Mon, 6 Jan 2025 at 19:41, Linas Vepstas <
> [email protected]> wrote:
> >> >> > >>>
> >> >> > >>> Hi Abu,
> >> >> > >>>
> >> >> > >>> class concurrent_set is provided by cogutils -- the solution
> would be to got to
> >> >> > >>> cd cogutils, git pull, rebuild and reinstall.  Then the
> atomspace
> >> >> > >>> should build. See here:
> >> >> > >>>
> >> >> > >>>
> https://github.com/opencog/cogutil/blob/be54bfcadaf8439f324cf525781b254c87fa0722/opencog/util/concurrent_set.h#L162-L168
> >> >> > >>>
> >> >> > >>> --linas
> >> >> > >>>
> >> >> > >>> On Sat, Jan 4, 2025 at 6:11 AM Abu Naser <[email protected]>
> wrote:
> >> >> > >>> >
> >> >> > >>> > Hi Everyone,
> >> >> > >>> >
> >> >> > >>> > The following error is thrown while I was compiling
> atomspace on Ubuntu:
> >> >> > >>> >
> >> >> > >>> >
> opencog_repos/atomspace/opencog/persist/proxy/WriteBufferProxy.cc:85:14:
> error: ‘class concurrent_set<opencog::Handle>’ has no member named ‘clear’
> >> >> > >>> >    85 |  _atom_queue.clear();
> >> >> > >>> >
> >> >> > >>> >
> >> >> > >>> > Is there any solution for this error?
> >> >> > >>> >
> >> >> > >>> >
> >> >> > >>> > Kind regards,
> >> >> > >>> >
> >> >> > >>> > Abu
> >> >> > >>> >
> >> >> > >>> > --
> >> >> > >>> > You received this message because you are subscribed to the
> Google Groups "opencog" group.
> >> >> > >>> > To unsubscribe from this group and stop receiving emails
> from it, send an email to [email protected].
> >> >> > >>> > To view this discussion visit
> https://groups.google.com/d/msgid/opencog/CAMw3wdg6zMZgwF0hwk_ibqHuMyc9EC30qsJQPbRwmqEnexXLNg%40mail.gmail.com
> .
> >> >> > >>>
> >> >> > >>>
> >> >> > >>>
> >> >> > >>> --
> >> >> > >>> Patrick: Are they laughing at us?
> >> >> > >>> Sponge Bob: No, Patrick, they are laughing next to us.
> >> >> > >>>
> >> >> > >>> --
> >> >> > >>> You received this message because you are subscribed to the
> Google Groups "opencog" group.
> >> >> > >>> To unsubscribe from this group and stop receiving emails from
> it, send an email to [email protected].
> >> >> > >>> To view this discussion visit
> https://groups.google.com/d/msgid/opencog/CAHrUA35N%2BhqA8CKtMBU4wJhmUQ8xKDivQHpx7%3DbdrZ9K_txg6Q%40mail.gmail.com
> .
> >> >> > >
> >> >> > > --
> >> >> > > You received this message because you are subscribed to the
> Google Groups "opencog" group.
> >> >> > > To unsubscribe from this group and stop receiving emails from
> it, send an email to [email protected].
> >> >> > > To view this discussion visit
> https://groups.google.com/d/msgid/opencog/CAMw3wdjKdP7tfgxReFeXJ8z7sEt9x53pP0VMUzttL8xxE9%3Djag%40mail.gmail.com
> .
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Patrick: Are they laughing at us?
> >> >> > Sponge Bob: No, Patrick, they are laughing next to us.
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Patrick: Are they laughing at us?
> >> >> Sponge Bob: No, Patrick, they are laughing next to us.
> >> >>
> >> >> --
> >> >> You received this message because you are subscribed to the Google
> Groups "opencog" group.
> >> >> To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected].
> >> >> To view this discussion visit
> https://groups.google.com/d/msgid/opencog/CAHrUA36fMLjzP%3D2%2Bzjr5BLf3qPdCOUfBv2-D7pttzCb1sjnodw%40mail.gmail.com
> .
> >> >
> >> > --
> >> > You received this message because you are subscribed to the Google
> Groups "opencog" group.
> >> > To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected].
> >> > To view this discussion visit
> https://groups.google.com/d/msgid/opencog/CAMw3wdgntkGmEFpZrs8wspfv8vFanEAT-6W-tbCQAt25-NQVyw%40mail.gmail.com
> .
> >>
> >>
> >>
> >> --
> >> Patrick: Are they laughing at us?
> >> Sponge Bob: No, Patrick, they are laughing next to us.
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups "opencog" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> >> To view this discussion visit
> https://groups.google.com/d/msgid/opencog/CAHrUA36XK%3DtaeydW5WT4KtqCNa8eJ%2BzsFVWZJBTYv_3%2Bws3rDg%40mail.gmail.com
> .
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "opencog" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> > To view this discussion visit
> https://groups.google.com/d/msgid/opencog/CAMw3wdjPeqSZ2APH9U6y-qR_t2Z3rtTUqdXC7GZ%2BVAen1ABb3A%40mail.gmail.com
> .
>
>
>
> --
> Patrick: Are they laughing at us?
> Sponge Bob: No, Patrick, they are laughing next to us.
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/opencog/CAHrUA35KUKC6hr-cB_ALJPVQC%3DACbAxrhhwRBmxA1iFvOUjseQ%40mail.gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/opencog/CAMw3wdhx9BfcpRFSWMCsz%2Bor0F6UAJ9Y3sxmv9g-VgC4aeJruA%40mail.gmail.com.

Reply via email to