Re: [opencog-dev] Indexing in the AtomSpace

Abdulrahman Semrie Tue, 01 Sep 2020 09:13:13 -0700


>
> So, without knowing the type, but only knowing the string name? My knee-jerk 
> reaction is you're doing something wrong, if you feel you need to do that. 
> You've mis-designed some data representation, somehow
>


I don’t understand how this can be due to a “mis-design” issue. It’s fairly 
common to index graph dbs using the name of the nodes (or other properties on 
the vertices/edges) and retrieve a node based on its name only. e.g - 
http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html

I think there are three approaches to add indexing to the atomspace, each with 
its own pros and cons.

The first potential solution , as you and Kasim have suggested, is to use 
MemberLinks to “index” the atoms. This has the benefit that it doesn’t require 
any additional work to add to the current atomspace. But it has the following 
disadvantages:

a. It could result in doubling the number of atoms in atomspace - i.e more RAM 
usage. It is infeasible for large atomspaces

b. As you have explained earlier, using MemberLink means more graph search -> 
more cpu usage

c. This solution requires the user of the atomspace to manage the index. The 
user has to take care of adding/deleting MemberLinks when a node is 
added/deleted.

d. We can’t add index atoms by their values using this approach. For example, 
if we want to retrieve all atoms that have values with the key key1, it is not 
possible with this approach

The second solution is adding support for indices that are managed by the 
atomspace similar to that of TypeIndex 
(https://github.com/opencog/atomspace/blob/master/opencog/atomspace/TypeIndex.h)
 and having a way for users to define custom indices on atoms. This has the 
benefit that the user of the atomspace doesn’t have to manually handle 
insertion and removal of atoms for the indices. It also allows indexing atoms 
by their Values. The cons with this solution, in addition to the extra work 
required, is as we add more indices inserting and deleting atoms will be slower.

The third solution is using external search engines/dbs (such as ElasticSearch 
or Apache Solr) to store the indices of the atomspaces. This moves managing the 
indices from the atomspace to the dbs and will improve the search time without 
having slower write speeds. But this requires to have some interface code to 
connect the atomspace with external index store.

Combination 2 & 3 is what I have seen being mostly used in other databases. For 
example, in the variant annotation work I did couple of months ago, I used 
mongodb to store the genomic data which indexed data by their id and used 
elasticsearch to store the location and gene indices of the variants.

—

Regards,

Abdulrahman Semrie

> On Thursday, Aug 27, 2020 at 9:43 PM, Linas Vepstas <[email protected] 
> (mailto:[email protected])> wrote:
>
>
> On Thu, Aug 27, 2020 at 12:55 PM Abdulrahman Semrie <[email protected] 
> (mailto:[email protected])> wrote:
> > > A second is to create a UniProtNode and use that; queries are then simple 
> > > because you just ask for all UniprotNodes.
> >
> > We are already using this approach. We have added new, data-source specific 
> > types to the atomspace and we use those types in pattern matching query.
> >
> > > A third (recommended) way is to write (MemberLink (Node "Uniprot: 1234") 
> > > (Concept "the-set-of-all-uniprots"))
> >
> > can you please explain why this approach is recommended compared to the 
> > second one? Doesn't using this approach add many links that can be avoided 
> > by having a specific type?
>
> The second approach is problematic, because it requires a human being to 
> create, edit and publish an "atom_types.script" file to github ... and then 
> download, compile, install. That's a hassle.
>
> It is possible to dynamically add new Atom types at run-time, but the API's 
> for this are incomplete; this is not encouraged, because it raises a variety 
> of other technical issues (which I can review, but is a bit off-topic).
>
> The third approach (of using MemberLinks and/or EvaluationLinks) is 
> "recommended" because it can be done at run-time, and has a giant raft of 
> neat features, bells and whistles for doing all kinds of stuff. It's very 
> flexible.
>
> The disadvantage of using MemberLinks is slightly larger RAM usage: about 100 
> bytes for the link itself, and maybe another 50 bytes for the incoming-set 
> index. (roughly; it's hard to measure because these bytes are not all in one 
> place; some are in c++ std::set internal nodes, etc.) This implies that one 
> million MemberLinks require 150MB extra RAM which ... well, that seems 
> reasonable, these days. Defining a UniProtNode does save you this RAM.
>
> Another disadvantage of using MemberLinks during search is that it requires 
> one extra graph-walk step during pattern matching -- i.e. walking from (Node 
> "Uniprot: 1234") to the (Concept "the-set-of-all-uniprots")) via the 
> MemberLink. This walk does take many thousands of cycles and lord-knows how 
> many cpu-cache misses. Based on my measurements from January with the actual 
> bio-agi/mozi datasets in use at that time, I had the impression that this 
> extra step accounts for 5% to 25% extra CPU, depending on the query. 
> Something like that. And since the queries you are doing are extremely 
> common, frequent and cpu-intensive, it makes sense to hand-optimize and 
> performance-tune, i.e. to invent a new UniProtNode.
>
> But both of these considerations are very highly specific to the agi-bio/mozi 
> project, where you are willing to put in the human labor costs to optimize 
> for performance. More generally, one must recognize that human labor is 
> expensive, and CPU-cycles/RAM is cheap, so the generic answer has to be "buy 
> more RAM, buy more CPU" because that is cheaper than paying humans. That's 
> why the third way is the recommended way.
>
> >
> > > . unless you mean "can I ask if (Node "uniprot: 1234") exists, without 
> > > accidentally creating it if it does not?"
> >
> > More like "can I ask if any node with name "uniprot:1234" exists? If so, 
> > can you return that node."
> >
> > > you can do this from the C++, scheme and python API's, but you cannot do 
> > > this in Atomese.
> >
> > If I know the type and the name, yes I can do this from the C++, scheme and 
> > python - I'm actually doing this in the C++ code for the rpc server. But in 
> > the case I'm describing, I only know the name and not the type. And to 
> > create a Handle to retrieve the atom, I need both the type and the name.
>
> So, without knowing the type, but only knowing the string name? My knee-jerk 
> reaction is you're doing something wrong, if you feel you need to do that. 
> You've mis-designed some data representation, somehow.
>
> But if you really, really want to ... you can get a list of all atom types, 
> and then ask the atomspace, one type at a time, if it has a node with that 
> name.
>
> Another possibility is to "create an index" via the third (recommended) way:
>
> (MemberLink (ProteinNode "unprot: 1234") (Concept "things named 1234"))
> (MemberLink (GoNode "structure: 1234") (Concept "things named 1234"))
> (MemberLink (PathwayNode "pathway: 1234") (Concept "things named 1234"))
>
> And then there's RegexNode .... which is theoretically ugly, because it is a 
> crutch, a work-around to a mis-designed data representation. The theoretical 
> ugliness in a nut-shell: if you really want to use strings, then don't use 
> the atomspace, use perl. That' is what perl is for. Regex's are finite state 
> machines, and the comp-sci industry has well-developed theories of how 
> finite-state machines work, and what you can do with them. More generally, 
> there is a well-defined theory of string manipulation and string-rewriting... 
> adding this to the atomspace is .... its ... well, it's like trying to 
> surgically attach that machinery ... it's like, .. I dunno, building a car 
> with three wheels powered by a gasoline engine and one wheel powered by an 
> electric motor. You can do it, but why?
>
> --linas
>
> >
> > On Thursday, August 27, 2020 at 8:33:29 PM UTC+3 linas wrote:
> > > I just provided three different solutions to that task... -- linas
> > >
> > > On Thu, Aug 27, 2020 at 11:14 AM Ben Goertzel <[email protected]> wrote:
> > > >
> > > > I think perhaps what Xabush wants is to be able to query
> > > >
> > > > " Find me all Atoms whose name string contains the substring "ABDPDQ". "
> > > >
> > > > even if he doesn't know what types these Atoms may be ?
> > > >
> > > > ben
> > > > On Thu, Aug 27, 2020 at 9:09 AM Linas Vepstas <[email protected]> 
> > > > wrote:
> > > > > This statement I find confusing: "I can’t write a pattern matching 
> > > > > query to retrieve an atom using its id/name" There is one and only 
> > > > > one such atom, ever, by definition... There is nothing to query; if 
> > > > > you know the name, you know the atom.
> > > > >
> > > > > There was talk previously about "substring matching", for example, 
> > > > > you have atoms named "Uniprot: 1234" and "Uniprot: 5678" and you want 
> > > > > to find all atoms that start with the eight characters "Uniprot:". 
> > > > > There are (at least) three solutions for this. One is to create a 
> > > > > RegexNode, but this is ugly from a theoretical standpoint. A second 
> > > > > is to create a UniProtNode and use that; queries are then simple 
> > > > > because you just ask for all UniprotNodes. A third (recommended) way 
> > > > > is to write (MemberLink (Node "Uniprot: 1234") (Concept 
> > > > > "the-set-of-all-uniprots"))
> > > > >
> > > > > This third way is recommended because, in a sense, the atomspace is 
> > > > > nothing but one giant network of interconnected partial indexes. 
> > > > > There is an index from (Node "Uniprot: 1234") to everything that 
> > > > > makes use of it -- its called "the incoming set" and it is a real 
> > > > > index - a c++ std::set if I recall. Same for (Concept 
> > > > > "the-set-of-all-uniprots") and what the pattern matcher "actually 
> > > > > does" is to stitch together these partial indexes into a whole, and 
> > > > > then prune away the irrelevant parts.
> > > > >
> > > > > -- Linas
> > > > >
> > > > > ... unless you mean "can I ask if (Node "uniprot: 1234") exists, 
> > > > > without accidentally creating it if it does not?" ... you can do this 
> > > > > from the C++, scheme and python API's, but you cannot do this in 
> > > > > Atomese.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Aug 27, 2020 at 4:07 AM Abdulrahman Semrie 
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > TL;DR: you can already do that. It's already supported.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > It’s partially supported. As you’ve described, we can cache the 
> > > > > > result of a pattern matching query and it is already supported. 
> > > > > > However, since I can’t write a pattern matching query to retrieve 
> > > > > > an atom using its id/name from the atomspace, there is no way to 
> > > > > > cache/index. If there was some ExistsLink that inherits from 
> > > > > > QueryLink where you can use to retrieve an atom by its name if it 
> > > > > > exists or return a false truth value, then what you’ve described 
> > > > > > can be done.
> > > > > >
> > > > > >
> > > > > > —
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Abdulrahman Semrie
> > > > > >
> > > > > > > On Thursday, Aug 27, 2020 at 2:46 AM, Linas Vepstas 
> > > > > > > <[email protected]> wrote:
> > > > > > > TL;DR: you can already do that. It's already supported.
> > > > > > >
> > > > > > > Please follow me on this train of thought.
> > > > > > >
> > > > > > > 1) What is an "index"? Well, its a pre-defined cache of all atoms 
> > > > > > > of some shape or pattern.
> > > > > > >
> > > > > > > 2) How can one specify an index? Well, if its a pattern, then a 
> > > > > > > pattern query can be used.
> > > > > > >
> > > > > > > 3) Where should the index be stored, or kept? Well, it can be 
> > > > > > > stored or kept with the pattern that defines the shape of the 
> > > > > > > index.
> > > > > > >
> > > > > > > Before I move on to the next thought, let me point out that 1-2-3 
> > > > > > > can be directly solved today. Define a pattern, e.g. a query 
> > > > > > > link. Run it. Store the results on the query, as a value. You can 
> > > > > > > "do this yourself", today, its easy, but it becomes even easier 
> > > > > > > if you are willing to read the docs for `cog-execute-cache!` 
> > > > > > > (appended below)
> > > > > > >
> > > > > > > 4) How should the index be updated? Ah, well, that is actually 
> > > > > > > the tricky question, the hard question, the place where all of 
> > > > > > > the interesting technology debates and thinking are centered. One 
> > > > > > > strategy is to update the index every single time an Atom is 
> > > > > > > added to/removed from the atomspace. But recomputing the index 
> > > > > > > every time is wildly inefficient, burning through vast quantities 
> > > > > > > of CPU time. What else can one do? Well, maybe recompute on 
> > > > > > > demand. Or recompute every few minutes. Or maybe once a night. 
> > > > > > > (aka "eventually consistent") Maybe store a time-stamp on the 
> > > > > > > index, to tell you how old it is. Or maybe have an append-only 
> > > > > > > log of atomspace changes... I can propose many different kinds of 
> > > > > > > solutions. They all have space and time-overhead, and/or assorted 
> > > > > > > usability issues. Which of these best suits your needs, I have 
> > > > > > > trouble guessing, so you would have to explain what the problem 
> > > > > > > is (if any).
> > > > > > >
> > > > > > > --linas
> > > > > > >
> > > > > > > Here's the docs:
> > > > > > > cog-execute-cache! EXEC KEY [METADATA [FRESH]]
> > > > > > >
> > > > > > > Execute or return cached execution results. This is a caching 
> > > > > > > version
> > > > > > > of the `cog-execute!` call.
> > > > > > >
> > > > > > > If the optional FRESH boolean flag is #f, then if there is a Value
> > > > > > > stored at KEY on EXEC, return that Value. The default value of 
> > > > > > > FRESH
> > > > > > > is #f, so the default behavior is always to return the cached 
> > > > > > > value.
> > > > > > > If the optional FRESH boolean flag is #t, or if there is no Value
> > > > > > > stored at KEY, then the `cog-execute!` function is called on EXEC,
> > > > > > > and the result is stored at KEY.
> > > > > > >
> > > > > > > The METADATA Atom is optional. If it is specified, then metadata
> > > > > > > about the execution is placed on EXEC at the key METADATA.
> > > > > > > Currently, this is just a timestamp of when this execution was
> > > > > > > performed. The format of the meta-data is subject to change; this
> > > > > > > is currently an experimental feature, driven by user requirements.
> > > > > > >
> > > > > > > At this time, execution is synchronous. It may be worthwhile to 
> > > > > > > have
> > > > > > > an asynchronous version of this call, where the execution is 
> > > > > > > performed
> > > > > > > at some other time. This has not been done yet.
> > > > > > >
> > > > > > > On Wed, Aug 26, 2020 at 7:41 AM Abdulrahman Semrie 
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > In the current atomspace, atoms are indexed by their type, i.e 
> > > > > > > > given a type we can retrieve all the atoms that have that type. 
> > > > > > > > But there is no other away of adding custom indices in the 
> > > > > > > > atomspace. For example, if we want to index nodes by their 
> > > > > > > > name, there is no way of doing this.
> > > > > > > >
> > > > > > > > As discussed in this issue 
> > > > > > > > (https://github.com/MOZI-AI/annotation-scheme/issues/192), we 
> > > > > > > > plan to expand the annotation-service, which uses the AtomSpace 
> > > > > > > > to store genomics data, to support the annotation of more types 
> > > > > > > > in addition to genes. Currently, when I user submits a list of 
> > > > > > > > ids to the service, it is assumed that these ids/symbols 
> > > > > > > > represent `GeneNode`s. But in the case where the input can be a 
> > > > > > > > protein, a drug molecule, pathway or a gene, there is no direct 
> > > > > > > > way of retrieving what type of the atom with the given name is 
> > > > > > > > unless we iterate through all atoms searching for that 
> > > > > > > > particular id. This isn't be a good approach from performance 
> > > > > > > > standpoint. But if we had a custom index - e.g `name_index`, on 
> > > > > > > > the ids/names of the atoms, it will be easier to search the 
> > > > > > > > atoms by name and identify the type that the atom belongs to.
> > > > > > > >
> > > > > > > > Hence, if there is a way to add custom indices to the 
> > > > > > > > atomspace, it will greatly simplify some searches. Or maybe 
> > > > > > > > there is a way to do what I described above without the need 
> > > > > > > > for an index. If so, please share it.
> > > > > > > > --
> > > > > > > > You received this message because you are subscribed to the 
> > > > > > > > Google Groups "opencog" group.
> > > > > > > > To unsubscribe from this group and stop receiving emails from 
> > > > > > > > it, send an email to [email protected].
> > > > > > > > To view this discussion on the web visit 
> > > > > > > > https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com
> > > > > > > >  
> > > > > > > > (https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com?utm_medium=email&utm_source=footer).
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Verbogeny is one of the pleasurettes of a creatific thinkerizer.
> > > > > > > --Peter da Silva
> > > > > > >
> > > > > > > --
> > > > > > > You received this message because you are subscribed to a topic 
> > > > > > > in the Google Groups "opencog" group.
> > > > > > > To unsubscribe from this topic, visit 
> > > > > > > https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
> > > > > > > To unsubscribe from this group and all its topics, send an email 
> > > > > > > to [email protected].
> > > > > > > To view this discussion on the web visit 
> > > > > > > https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com
> > > > > > >  
> > > > > > > (https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com?utm_medium=email&utm_source=footer).
> > > > > > --
> > > > > > You received this message because you are subscribed to the Google 
> > > > > > Groups "opencog" group.
> > > > > > To unsubscribe from this group and stop receiving emails from it, 
> > > > > > send an email to [email protected].
> > > > > > To view this discussion on the web visit 
> > > > > > https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary
> > > > > >  
> > > > > > (https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary?utm_medium=email&utm_source=footer).
> > > > >
> > > > >
> > > > > --
> > > > > Verbogeny is one of the pleasurettes of a creatific thinkerizer.
> > > > > --Peter da Silva
> > > > >
> > > > > --
> > > > > You received this message because you are subscribed to the Google 
> > > > > Groups "opencog" group.
> > > > > To unsubscribe from this group and stop receiving emails from it, 
> > > > > send an email to [email protected].
> > > > > To view this discussion on the web visit 
> > > > > https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com
> > > > >  
> > > > > (https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com?utm_medium=email&utm_source=footer).
> > > >
> > > >
> > > > --
> > > > Ben Goertzel, PhD
> > > > http://goertzel.org
> > > >
> > > > “The only people for me are the mad ones, the ones who are mad to live, 
> > > > mad to talk, mad to be saved, desirous of everything at the same time, 
> > > > the ones who never yawn or say a commonplace thing, but burn, burn, 
> > > > burn like fabulous yellow roman candles exploding like spiders across 
> > > > the stars.” -- Jack Kerouac --
> > > > You received this message because you are subscribed to the Google 
> > > > Groups "opencog" group.
> > > > To unsubscribe from this group and stop receiving emails from it, send 
> > > > an email to [email protected].
> > > > To view this discussion on the web visit 
> > > > https://groups.google.com/d/msgid/opencog/CACYTDBeqdq0vixYq1M0kceBqyywkAvQMPsMOd51X-0V5Oagr2Q%40mail.gmail.com
> > > >  
> > > > (https://groups.google.com/d/msgid/opencog/CACYTDBeqdq0vixYq1M0kceBqyywkAvQMPsMOd51X-0V5Oagr2Q%40mail.gmail.com?utm_medium=email&utm_source=footer).
> > >
> > >
> > > --
> > > Verbogeny is one of the pleasurettes of a creatific thinkerizer.
> > > --Peter da Silva
> > >
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "opencog" group.
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to [email protected] 
> > (mailto:[email protected]).
> > To view this discussion on the web visit 
> > https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com
> >  
> > (https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com?utm_medium=email&utm_source=footer).
>
>
> --
> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
> --Peter da Silva
>
> --
> You received this message because you are subscribed to a topic in the Google 
> Groups "opencog" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> [email protected] 
> (mailto:[email protected]).
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/opencog/CAHrUA35Z%3DH1oSVFZ%3D-WTTATf4U9jhmfhMMAF6jNO1daTrbDXJg%40mail.gmail.com
>  
> (https://groups.google.com/d/msgid/opencog/CAHrUA35Z%3DH1oSVFZ%3D-WTTATf4U9jhmfhMMAF6jNO1daTrbDXJg%40mail.gmail.com?utm_medium=email&utm_source=footer).

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/e4efdf67-88d3-4163-9e98-78363fc6ed0a%40Canary.

Re: [opencog-dev] Indexing in the AtomSpace

Reply via email to