Re: Atomspace RAM & CPU usage ... was Re: [opencog-dev] Indexing in the AtomSpace

Abdulrahman Semrie Tue, 08 Sep 2020 13:32:32 -0700

Great write up Linas!

I'm fuzzy on the formula for the size of vertex and edge tables on page 6. 
It'd be great if you added an explanation to make it more clear.


With regards to indexing, the benefit of using graph dbs for partial 
indices is clear. But I have one question  with regards to the current 
Atomspace design. In your example, you represent the departments as 
"privileged vertices" and connect them to their respective employee 
vertices. In the current AtomSpace, there is a TypeIndex which is 
represented using a hash table (std::unordered_multimap to be exact). Why 
not represent the types using vertices and connect every other atom to the 
type vertice it belongs to? Like you suggested above, something like 
(MemberLink (Concept "Uniprot:12233") (Concept "ProteinNode")). This will 
lead to some type vertices being "Supernodes" in that a single vertex will 
be connected to many vertices, perhaps millions of vertices. This will 
result in a performance issue with naive graph db representations because 
the outgoing set of the type vertices will be very large. Titandb solves 
this by having the concept of unidirectional edge where only the 
destination vertex is aware of its connection to the supernode. But looking 
at the hypergraph tables in the document, this problem is already solved. 
So why not use this approach for the TypeIndex?

Re: using MemberLinks as a way of indexing by name

Where do you think all of that RAM usage is going? Where do you think 
indexes are kept? The MemberLink maintains indexes in the incoming/outgoing 
sets, those are just c++ std::set and std::vector, respectively. If you 
create some other index, you are just moving around where the RAM is being 
used.  You're talking about shifting around the internal representation; 
you are not proposing anything that will actually decrease RAM usage. 

Correct me if I'm wrong but won't using std::unorder_multimap<string, 
Handle> will have less RAM usage than creating new ConceptNode and 
MemberLinks for indexing?
Also how about adding an api in the atomspace so that we can use external 
index stores like ElasticSearch/Apache Solr? This is especially useful if 
we want to do full-text search on atom names.  I can help with this 
integration if you think this idea is worthwhile.


On Friday, September 4, 2020 at 8:23:00 AM UTC+3 linas wrote:

> My apologies for the incessant email-bombing. Not sure how else to 
> communicate. Here is version 0.2 of the doc. I think (I hope) that the 
> pages 4-5-6 clearly demonstrate how hypergraphs  offer a RAM-usage 
> performance over ordinary graph stores. If you think you see flawed 
> arguments there, please let me know.  The new pages 8-9 on indexing then 
> explain how indexing on a conventional SQL/noSQL DB just happens to look 
> identical to a hypergraph. Bet no one ever told you that, before!
>
> The key point: for SQL/noSQL DB's, the indexes are always hidden away from 
> the user, and cannot be directly accessed, walked, manipulated; they are 
> internal-use-only by the DB.When you use a graph DB, you get direct access 
> to "indexes" as user-visible and user-controllable objects.   If you use a 
> hypergraph instead of a graph, then you also get a (large) space savings.
>
> To be written: how pattern-matching compares to tinkerpop/gremlin. The 
> theory of recursive algorithms (which is what pattern matching actually is, 
> under the covers). Specifying recursive algorithms with context free 
> grammars. Finally, parsing with context-free grammars. The 
> connectors/sheaves thing is the cornerstone for parsing  with context-free 
> grammars. The "secret sauce" here is (I think) that this can be done with 
> RAM/CPU consumption as good as hypergraph DB's. So the goal of writing this 
> is to spell out in painful detail all these steps.
>
> The source material for this is in 
> https://github.com/linas/atomspace/blob/sheaf-ram-cpu/opencog/sheaf/docs/ 
> which I'll merge when a bit farther along.
>
> --linas
>
> p.s. I am cc'ing the link-grammar mailing list because (1) all this sheaf 
> stuff is inspired by link-grammar, (2) perhaps I can convince Amir to offer 
> critical feedback. Amir, I suspect that if you read this, you might think, 
> "oh that Linas, he's kooky and is playing fast-and-loose with the facts" 
> but I think that spotting weaknesses and outright errors is best done early 
> & is something you're good at.
>
> On Thu, Sep 3, 2020 at 2:55 PM Linas Vepstas <[email protected]> wrote:
>
>> As threatened, I started writing up the ram/cpu usage for storing 
>> sheaves, hypergraphs and ordinary graphs. Unfortunately, it's like boiling 
>> the ocean, and is spiralling out of control.  Rather than presenting an 
>> overbearingly long text, here is draft version 0.1, it is seven pages long; 
>> you already know the first five pages; I draw your attention to page 6, 
>> where I compare RAM usage for hypergraph stores to ordinary graph stores, 
>> and then specialize it to the special case of genomics/proteomics datasets, 
>> which have that square-root-Zipf profile (which is also what wikipedia 
>> article connectivity has ... the square-root-Zipf is not biology-specific, 
>> its generic.)
>>
>> The conclusion is that graph stores are bloated and heavy and 
>> inefficient, compared to hypergraph stores. If you think I made a mistake 
>> somewhere, please call it out.   I often make mistakes or outright 
>> blunders. The point is, again, this stuff has real implications for 
>> usability and computability, e.g. for Ben's latest Hyperion ideas, and so 
>> it would be nice to get the facts and the numbers correct, so that it 
>> doesn't look like we're blowing fumes. 
>>
>> -- Linas
>>
>> On Wed, Sep 2, 2020 at 1:01 PM Linas Vepstas <[email protected]> wrote:
>>
>>>
>>>
>>> On Tue, Sep 1, 2020 at 11:13 AM Abdulrahman Semrie <[email protected]> 
>>> wrote:
>>>
>>>>
>>>>
>>>> So, without knowing the type, but only knowing the string name? My 
>>>> knee-jerk reaction is you're doing something wrong, if you feel you need 
>>>> to 
>>>> do that.  You've mis-designed some data representation, somehow
>>>>
>>>> I don’t understand how this can be due to a “mis-design” issue. It’s 
>>>> fairly common to index graph dbs using the name of the nodes (or other 
>>>> properties on the vertices/edges) and retrieve a node based on its name 
>>>> only. e.g - http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html
>>>>
>>> I don't understand what you don't understand :-) If you need to search 
>>> something by name, why don't you put it all into the same name-space? Just 
>>> make everything a ConceptNode, then you're done.  The whole point of having 
>>> namespaces, is that they're namespaces, but now you want to not have 
>>> namespaces. -- You've got to kind of figure out whether you want to have 
>>> them and use them, or whether you don't want to have them. So I'm back to 
>>> the initial point: what are you actually trying to do?
>>>
>>> I think there are three approaches to add indexing to the atomspace, 
>>>> each with its own pros and cons. 
>>>>
>>>>    1. 
>>>>    
>>>>    The first potential solution , as you and Kasim have suggested, is 
>>>>    to use MemberLinks to “index” the atoms. This has the benefit that 
>>>>    it doesn’t require any additional work to add to the current atomspace. 
>>>> But 
>>>>    it has the following disadvantages:
>>>>    
>>>>    a. It could result in doubling the number of atoms in atomspace - 
>>>>    i.e more RAM usage. It is infeasible for large atomspaces
>>>>    
>>>> Why do you think it doubles? Where do you think all of that RAM usage 
>>> is going? Where do you think indexes are kept?  The MemberLink maintains 
>>> indexes in the incoming/outgoing sets, those are just c++ std::set and 
>>> std::vector, respectively. If you create some other index, you are just 
>>> moving around where the RAM is being used.  You're talking about shifting 
>>> around the internal representation; you are not proposing anything that 
>>> will actually decrease RAM usage. 
>>>
>>> Take a look at RAM and disk usage in one of these commercial databases. 
>>> Create a table. Put a milllion things into it. Place an index on the table; 
>>> RAM usage will double. Place two indexes on it. RAM usage will triple. It's 
>>> quite linear, and it's a worth-while experiment to perform, if you are 
>>> interested in databases.  It will provide some hints into how these systems 
>>> are actually designed, under the covers.
>>>
>>> (And not just RAM, disk, too, when stored to disk. Disk usage doubles, 
>>> triples ... as you add indexes.)
>>>
>>>
>>>>    1. 
>>>>    
>>>>    b. As you have explained earlier, using MemberLink means more graph 
>>>>    search -> more cpu usage
>>>>    
>>>>    c. This solution requires the user of the atomspace to manage the 
>>>>    index. The user has to take care of adding/deleting MemberLinks 
>>>>    when a node is added/deleted.
>>>>    
>>>> Nothing in life is free. It is not the "user" who is adding these "by 
>>> hand", it is a block of code. That block of code requires the expenditure 
>>> of CPU cycles. You can re-arrange the location where those CPU cycles are 
>>> spent, but you can't remove it.  
>>>
>>> Adding/deleting MemberLinks is trivial, so the complaint is unfounded.  
>>>
>>> Only you know what the structure of your data needs to be.  You have to 
>>> manage the structure of your data.
>>>
>>> As a general rule, you cannot force the AtomSpace to guess what you 
>>> wanted to have.  Trying to automate indexes is particularly damaging: for 
>>> every atom insertion/deletion, the "automatic" code would have to trigger 
>>> the indexer, analyze the kind of atom being inserted, and then make a 
>>> decision as to which indexes need to be updated.
>>>
>>> Again: take a commercial database, pick one, any one, and place a 
>>> million items into it. Then measure insert/delete performance.  Then place 
>>> an index on it; repeat. Put two indexes on it, repeat. As before, you will 
>>> discover that CPU time will double, triple, quadruple for each new index 
>>> that you add. It is very linear.  It's O(N) where N is the number of 
>>> indexes.
>>>
>>> The AtomSpace is very unique -- it offers you indexes that run at O(1) 
>>> instead of O(N).  This is because the atomspace design has a trick that few 
>>> other (maybe no other) databases offer: it has fractional indexes. aka 
>>> incoming/outgoing sets.  When you add a MemberLink, you update only a 
>>> handful of very small, per-atom indexes (three, to be precise: for (Member 
>>> A B) you update incoming-A, incoming-B and outgoing Member) Because they 
>>> are tiny, they fit on one CPU cache-line. Because they are tiny, there is 
>>> no tree-rebalancing. Because they are tiny, there are no hash-buckets.  
>>>
>>> But wait, there's more -- the most important point is that if you add 
>>> some other atom -- say (Concept "foo") -- then exactly ZERO index updates 
>>> are needed! Compare this to the commercial databases: if you have a 
>>> commercial database with N=5 indexes, and K=1000000 entries, you have to 
>>> update N indexes of size K, when you add "foo" to some table. CPU time will 
>>> be typically be O(N log K) on those commercial databases, vs O(1) for the 
>>> AtomSpace.
>>>
>>> Seriously, this is a big deal. The AtomSpace literally does something 
>>> that other SQL/noSQL databases literally cannot do. This is something that 
>>> "graph databases" can kind-of do, because (in principle) graph databases 
>>> can also maintain fractional indexes (that is what makes them graphs)  
>>> Whether they actually do this, I don't know. 
>>>
>>> This is one reason I'm working so hard on the "sheaf" theory -- the 
>>> sheaving shows you precisely how to manage RAM and CPU for the connectivity 
>>> of a graph.   I guess I now have to write yet another PDF in this series, 
>>> talking about indexes. The sheaves aren't just some willy-nilly 
>>> abstraction: they have real-life implications for RAM and CPU.
>>>
>>> This is just email, so if you are uncertain of what I write or don't 
>>> believe me, you can actually perform these experiments yourself, make the 
>>> measurements. Prove me wrong -- I'm not always right. There's always room 
>>> for improvement.  But I'm certain that, in this case, I'm right, because 
>>> I've, uhh, "been around the block a few times". Or, uhh, "been there, done 
>>> that". 
>>>
>>> Having well-done, well-documented measurements would be nice: we could 
>>> publish a white-paper comparing atomspace performance to performance of 
>>> database X. I'm fairly confident we'll come out looking pretty good.
>>>
>>>
>>>>    1. 
>>>>    
>>>>    d. We can’t add index atoms by their values using this approach. 
>>>>    For example, if we want to retrieve all atoms that have values with the 
>>>> key 
>>>>    key1, it is not possible with this approach
>>>>    2. 
>>>>    
>>>>    The second solution is adding support for indices that are managed 
>>>>    by the atomspace similar to that of TypeIndex 
>>>>    
>>>> <https://github.com/opencog/atomspace/blob/master/opencog/atomspace/TypeIndex.h>
>>>>  
>>>>    and having a way for users to define custom indices on atoms. This has 
>>>> the 
>>>>    benefit that the user of the atomspace doesn’t have to manually handle 
>>>>    insertion and removal of atoms for the indices. It also allows indexing 
>>>>    atoms by their Values. The cons with this solution, in addition to the 
>>>>    extra work required, is as we add more indices inserting and deleting 
>>>> atoms 
>>>>    will be slower.
>>>>    3. 
>>>>    
>>>>    The third solution is using external search engines/dbs (such as 
>>>>    ElasticSearch or Apache Solr) to store the indices of the atomspaces. 
>>>> This 
>>>>    moves managing the indices from the atomspace to the dbs and will 
>>>> improve 
>>>>    the search time without having slower write speeds. But this requires 
>>>> to 
>>>>    have some interface code to connect the atomspace with external index 
>>>> store.
>>>>    
>>>> Combination 2 & 3 is what I have seen being mostly used in other 
>>>> databases. For example, in the variant annotation work I did couple of 
>>>> months ago, I used mongodb to store the genomic data which indexed data by 
>>>> their id and used elasticsearch to store the location and gene indices of 
>>>> the variants. 
>>>>
>>>
>>> I've nothing more to say here. This is why "other databases" suck. 
>>> Seriously -- you should try to actually use some of them, before you tell 
>>> me how great they are. Spend a few weeks, spend a month. You'll be back. 
>>> And if you don't come back, .. well, hey, then clearly, the Atomspace was 
>>> not the solution for whatever problem you are trying to solve.  The 
>>> AtomSpace does not solve every problem. But the ones it does solve, it 
>>> solves better than anything else out there.
>>>
>>> --linas
>>>
>>> —
>>>>
>>>> Regards,
>>>>
>>>> Abdulrahman Semrie
>>>> <https://canarymail.io>
>>>>
>>>> On Thursday, Aug 27, 2020 at 9:43 PM, Linas Vepstas <
>>>> [email protected]> wrote:
>>>>
>>>>
>>>> On Thu, Aug 27, 2020 at 12:55 PM Abdulrahman Semrie <[email protected]> 
>>>> wrote:
>>>>
>>>>> > A second is to create a UniProtNode and use that; queries are then 
>>>>> simple because you just ask for all UniprotNodes.
>>>>>
>>>>> We are already using this approach. We have added new, data-source 
>>>>> specific types to the atomspace and we use those types in pattern 
>>>>> matching 
>>>>> query.
>>>>>
>>>>> > A third (recommended) way is to write  (MemberLink (Node "Uniprot: 
>>>>> 1234") (Concept "the-set-of-all-uniprots")) 
>>>>>
>>>>> can you please explain why this approach is recommended compared to 
>>>>> the second one? Doesn't using this approach add many links that can be 
>>>>> avoided by having a specific type?
>>>>>
>>>>
>>>> The second approach is problematic, because it requires a human being 
>>>> to create, edit and publish an "atom_types.script" file to github ... and 
>>>> then download, compile, install. That's a hassle. 
>>>>
>>>> It is possible to dynamically add new Atom types at run-time, but the 
>>>> API's for this are incomplete; this is not encouraged, because it raises a 
>>>> variety of other technical issues (which I can review, but is a bit 
>>>> off-topic).
>>>>
>>>> The third approach (of using MemberLinks and/or EvaluationLinks) is 
>>>> "recommended"  because it can be done at run-time, and has a giant raft of 
>>>> neat features, bells and whistles for doing all kinds of stuff. It's very 
>>>> flexible.
>>>>
>>>> The disadvantage of using MemberLinks is slightly larger RAM usage: 
>>>> about 100 bytes for the link itself, and maybe another 50 bytes for the 
>>>> incoming-set index. (roughly; it's hard to measure because these bytes are 
>>>> not all in one place; some are in c++ std::set internal  nodes, etc.)  
>>>> This 
>>>> implies that one million MemberLinks require 150MB extra RAM which ... 
>>>> well, that seems reasonable, these days.  Defining a UniProtNode does save 
>>>> you this RAM.
>>>>
>>>> Another disadvantage of using MemberLinks during search is that it 
>>>> requires one extra graph-walk step during pattern matching -- i.e. walking 
>>>> from (Node "Uniprot: 1234") to the (Concept "the-set-of-all-uniprots")) 
>>>> via 
>>>> the MemberLink. This walk does take many thousands of cycles and 
>>>> lord-knows 
>>>> how many cpu-cache misses.   Based on my measurements from January with 
>>>> the 
>>>> actual bio-agi/mozi datasets in use at that time, I had the impression 
>>>> that 
>>>> this extra step accounts for 5% to 25% extra CPU, depending on the query.  
>>>> Something like that. And since the queries you are doing are extremely 
>>>> common, frequent and cpu-intensive, it makes sense to hand-optimize and 
>>>> performance-tune, i.e. to invent a new UniProtNode. 
>>>>
>>>> But both of these considerations are very highly specific to the 
>>>> agi-bio/mozi project, where you are willing to put in the human labor 
>>>> costs 
>>>> to optimize for performance. More generally, one must recognize that human 
>>>> labor is expensive, and CPU-cycles/RAM is cheap, so the generic answer has 
>>>> to be "buy more RAM, buy more CPU" because that is cheaper than paying 
>>>> humans.  That's why the third way is the recommended way.
>>>>
>>>>
>>>>> > . unless you mean "can I ask if (Node "uniprot: 1234") exists, 
>>>>> without accidentally creating it if it does not?" 
>>>>>
>>>>> More like "can I ask if any node with name "uniprot:1234" exists? If 
>>>>> so, can you return that node."
>>>>>
>>>>> > you can do this from the C++, scheme and python API's, but you 
>>>>> cannot do this in Atomese.
>>>>>
>>>>> If I know the type and the name, yes I can do this from the C++, 
>>>>> scheme and python - I'm actually doing this in the C++ code for the rpc 
>>>>> server. But in the case I'm describing, I only know the name and not the 
>>>>> type. And to create a Handle to retrieve the atom, I need both the type 
>>>>> and 
>>>>> the name.
>>>>>
>>>>
>>>> So, without knowing the type, but only knowing the string name? My 
>>>> knee-jerk reaction is you're doing something wrong, if you feel you need 
>>>> to 
>>>> do that.  You've mis-designed some data representation, somehow. 
>>>>
>>>> But if you really, really want to ... you can get a list of all atom 
>>>> types, and then ask the atomspace, one type at a time, if it has a node 
>>>> with that name.
>>>>
>>>> Another possibility is to "create an index" via the third (recommended) 
>>>> way:
>>>>
>>>> (MemberLink (ProteinNode "unprot: 1234") (Concept "things named 1234"))
>>>> (MemberLink (GoNode "structure: 1234") (Concept "things named 1234"))
>>>> (MemberLink (PathwayNode "pathway: 1234") (Concept "things named 1234"))
>>>>
>>>> And then there's RegexNode .... which is theoretically ugly, because it 
>>>> is a crutch, a work-around to a mis-designed data representation. The 
>>>> theoretical ugliness in a nut-shell: if you really want to use strings, 
>>>> then don't use the atomspace, use perl. That' is what perl is for. Regex's 
>>>> are finite state machines, and the comp-sci industry has well-developed 
>>>> theories of how finite-state machines work, and what you can do with them. 
>>>> More generally, there is a well-defined theory of string manipulation and 
>>>> string-rewriting... adding this to the atomspace is .... its ...  well, 
>>>> it's like trying to surgically attach that machinery ... it's like, .. I 
>>>> dunno, building a car with three wheels powered by a gasoline engine and 
>>>> one wheel powered by an electric motor. You can do it, but why? 
>>>>
>>>> --linas
>>>>   
>>>>
>>>>>
>>>>> On Thursday, August 27, 2020 at 8:33:29 PM UTC+3 linas wrote:
>>>>>
>>>>>> I just provided three different solutions to that task... -- linas
>>>>>>
>>>>>> On Thu, Aug 27, 2020 at 11:14 AM Ben Goertzel <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> I think perhaps what Xabush wants is to be able to query
>>>>>>>
>>>>>>> " Find me all Atoms whose name string contains the substring 
>>>>>>> "ABDPDQ".  "
>>>>>>>
>>>>>>> even if he doesn't know what types these Atoms may be ?
>>>>>>>
>>>>>>> ben
>>>>>>>
>>>>>>> On Thu, Aug 27, 2020 at 9:09 AM Linas Vepstas <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This statement I find confusing: "I can’t write a pattern matching 
>>>>>>>> query to retrieve an atom using its id/name" There is one and only one 
>>>>>>>> such 
>>>>>>>> atom, ever, by definition... There is nothing to query; if you know 
>>>>>>>> the 
>>>>>>>> name, you know the atom. 
>>>>>>>>
>>>>>>>> There was talk previously about "substring matching", for example, 
>>>>>>>> you have atoms  named "Uniprot: 1234" and "Uniprot: 5678" and you want 
>>>>>>>> to 
>>>>>>>> find all atoms that start with the eight characters "Uniprot:". There 
>>>>>>>> are 
>>>>>>>> (at least) three solutions for this. One is to create a RegexNode, but 
>>>>>>>> this 
>>>>>>>> is ugly from a theoretical standpoint. A second is to create a 
>>>>>>>> UniProtNode 
>>>>>>>> and use that; queries are then simple because you just ask for all 
>>>>>>>> UniprotNodes.  A third (recommended) way is to write  (MemberLink 
>>>>>>>> (Node 
>>>>>>>> "Uniprot: 1234") (Concept "the-set-of-all-uniprots")) 
>>>>>>>>
>>>>>>>> This third way is recommended because, in a sense, the atomspace is 
>>>>>>>> nothing but one giant network of interconnected partial indexes. There 
>>>>>>>> is 
>>>>>>>> an index from (Node "Uniprot: 1234") to everything that makes use of 
>>>>>>>> it -- 
>>>>>>>> its called "the incoming set" and it is a real index - a c++ std::set  
>>>>>>>> if I 
>>>>>>>> recall. Same for (Concept "the-set-of-all-uniprots") and what the 
>>>>>>>> pattern 
>>>>>>>> matcher "actually does" is to stitch together these partial indexes 
>>>>>>>> into a 
>>>>>>>> whole, and then prune away the irrelevant parts.
>>>>>>>>
>>>>>>>> -- Linas
>>>>>>>>
>>>>>>>> ... unless you mean "can I ask if (Node "uniprot: 1234") exists, 
>>>>>>>> without accidentally creating it if it does not?" ... you can do this 
>>>>>>>> from 
>>>>>>>> the C++, scheme and python API's, but you cannot do this in Atomese.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 27, 2020 at 4:07 AM Abdulrahman Semrie <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> TL;DR: you can already do that.  It's already supported.
>>>>>>>>>
>>>>>>>>> It’s partially supported. As you’ve described, we can cache the 
>>>>>>>>> result of a pattern matching query and it is already supported. 
>>>>>>>>> However, 
>>>>>>>>> since I can’t write a pattern matching query to retrieve an atom 
>>>>>>>>> using its 
>>>>>>>>> id/name from the atomspace, there is no way to cache/index. If there 
>>>>>>>>> was 
>>>>>>>>> some ExistsLink that inherits from QueryLink where you can use to 
>>>>>>>>> retrieve an atom by its name if it exists or return a false truth 
>>>>>>>>> value, 
>>>>>>>>> then what you’ve described can be done. 
>>>>>>>>>
>>>>>>>>> —
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Abdulrahman Semrie
>>>>>>>>> <https://canarymail.io>
>>>>>>>>>
>>>>>>>>> On Thursday, Aug 27, 2020 at 2:46 AM, Linas Vepstas <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> TL;DR: you can already do that.  It's already supported.
>>>>>>>>>
>>>>>>>>> Please follow me on this train of thought.
>>>>>>>>>
>>>>>>>>> 1) What is an "index"? Well, its a pre-defined cache of all atoms 
>>>>>>>>> of some shape or pattern.
>>>>>>>>>
>>>>>>>>> 2) How can one specify an index?  Well, if its a pattern, then a 
>>>>>>>>> pattern query can be used.
>>>>>>>>>
>>>>>>>>> 3) Where should the index be stored, or kept? Well, it can be 
>>>>>>>>> stored or kept with the pattern that defines the shape of the index.
>>>>>>>>>
>>>>>>>>> Before I move on to the next thought, let me point out that 1-2-3 
>>>>>>>>> can be directly solved today. Define a pattern, e.g. a query link. 
>>>>>>>>> Run it. 
>>>>>>>>> Store the results on the query, as a value. You can "do this 
>>>>>>>>> yourself", 
>>>>>>>>> today, its easy, but it becomes even easier if you are willing to 
>>>>>>>>> read the 
>>>>>>>>> docs for `cog-execute-cache!` (appended below)
>>>>>>>>>
>>>>>>>>> 4) How should the index be updated? Ah, well, that is actually the 
>>>>>>>>> tricky question, the hard question, the place where all of the 
>>>>>>>>> interesting 
>>>>>>>>> technology debates and thinking are centered.  One strategy is to 
>>>>>>>>> update 
>>>>>>>>> the index every single time an Atom is added to/removed from the 
>>>>>>>>> atomspace. 
>>>>>>>>> But recomputing the index every time is wildly inefficient, burning 
>>>>>>>>> through 
>>>>>>>>> vast quantities of CPU time. What else can one do? Well, maybe 
>>>>>>>>> recompute on 
>>>>>>>>> demand. Or recompute every few minutes. Or maybe once a night. (aka 
>>>>>>>>> "eventually consistent")  Maybe store a time-stamp on the index, to 
>>>>>>>>> tell 
>>>>>>>>> you how old it is. Or maybe have an append-only log of atomspace 
>>>>>>>>> changes... 
>>>>>>>>> I can propose many different kinds of solutions. They all have space 
>>>>>>>>> and 
>>>>>>>>> time-overhead, and/or assorted usability issues. Which of these best 
>>>>>>>>> suits 
>>>>>>>>> your needs, I have trouble guessing, so you would have to explain 
>>>>>>>>> what the 
>>>>>>>>> problem is (if any).
>>>>>>>>>
>>>>>>>>> --linas
>>>>>>>>>
>>>>>>>>> Here's the docs:
>>>>>>>>>  cog-execute-cache! EXEC KEY [METADATA [FRESH]]
>>>>>>>>>
>>>>>>>>>    Execute or return cached execution results. This is a caching 
>>>>>>>>> version
>>>>>>>>>    of the `cog-execute!` call.
>>>>>>>>>
>>>>>>>>>    If the optional FRESH boolean flag is #f, then if there is a 
>>>>>>>>> Value
>>>>>>>>>    stored at KEY on EXEC, return that Value. The default value of 
>>>>>>>>> FRESH
>>>>>>>>>    is #f, so the default behavior is always to return the cached 
>>>>>>>>> value.
>>>>>>>>>    If the optional FRESH boolean flag is #t, or if there is no 
>>>>>>>>> Value
>>>>>>>>>    stored at KEY, then the `cog-execute!` function is called on 
>>>>>>>>> EXEC,
>>>>>>>>>    and the result is stored at KEY.
>>>>>>>>>
>>>>>>>>>    The METADATA Atom is optional.  If it is specified, then 
>>>>>>>>> metadata
>>>>>>>>>    about the execution is placed on EXEC at the key METADATA.
>>>>>>>>>    Currently, this is just a timestamp of when this execution was
>>>>>>>>>    performed. The format of the meta-data is subject to change; 
>>>>>>>>> this
>>>>>>>>>    is currently an experimental feature, driven by user 
>>>>>>>>> requirements.
>>>>>>>>>
>>>>>>>>>    At this time, execution is synchronous. It may be worthwhile to 
>>>>>>>>> have
>>>>>>>>>    an asynchronous version of this call, where the execution is 
>>>>>>>>> performed
>>>>>>>>>    at some other time. This has not been done yet.
>>>>>>>>>
>>>>>>>>> On Wed, Aug 26, 2020 at 7:41 AM Abdulrahman Semrie <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the current atomspace, atoms are indexed by their type, i.e 
>>>>>>>>>> given a type we can retrieve all the atoms that have that type. But 
>>>>>>>>>> there 
>>>>>>>>>> is no other away of adding custom indices in the atomspace. For 
>>>>>>>>>> example, if 
>>>>>>>>>> we want to index nodes by their name, there is no way of doing this. 
>>>>>>>>>>
>>>>>>>>>> As discussed in this issue 
>>>>>>>>>> <https://github.com/MOZI-AI/annotation-scheme/issues/192>, we 
>>>>>>>>>> plan to expand the annotation-service, which uses the AtomSpace to 
>>>>>>>>>> store 
>>>>>>>>>> genomics data, to support the annotation of more types in addition 
>>>>>>>>>> to 
>>>>>>>>>> genes. Currently, when I user submits a list of ids to the service, 
>>>>>>>>>> it is 
>>>>>>>>>> assumed that these ids/symbols represent `GeneNode`s. But in the 
>>>>>>>>>> case where 
>>>>>>>>>> the input can be a protein, a drug molecule, pathway or a gene, 
>>>>>>>>>> there is no 
>>>>>>>>>> direct way of retrieving what type of the atom with the given name 
>>>>>>>>>> is 
>>>>>>>>>> unless we iterate through all atoms searching for that particular 
>>>>>>>>>> id. This 
>>>>>>>>>> isn't be a good approach from performance standpoint. But if we had 
>>>>>>>>>> a 
>>>>>>>>>> custom index - e.g `name_index`, on the ids/names of the atoms, it 
>>>>>>>>>> will be 
>>>>>>>>>> easier to search the atoms by name and identify the type that the 
>>>>>>>>>> atom 
>>>>>>>>>> belongs to. 
>>>>>>>>>>
>>>>>>>>>> Hence, if there is a way to add custom indices to the atomspace, 
>>>>>>>>>> it will greatly simplify some searches. Or maybe there is a way to 
>>>>>>>>>> do what 
>>>>>>>>>> I described above without the need for an index. If so, please share 
>>>>>>>>>> it.
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "opencog" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com
>>>>>>>>>>  
>>>>>>>>>> <https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>>>>>>>>         --Peter da Silva
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to a topic in 
>>>>>>>>> the Google Groups "opencog" group.
>>>>>>>>> To unsubscribe from this topic, visit 
>>>>>>>>> https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
>>>>>>>>> To unsubscribe from this group and all its topics, send an email 
>>>>>>>>> to [email protected].
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "opencog" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>>>>>>>         --Peter da Silva
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "opencog" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Ben Goertzel, PhD
>>>>>>> http://goertzel.org
>>>>>>>
>>>>>>> “The only people for me are the mad ones, the ones who are mad to 
>>>>>>> live, mad to talk, mad to be saved, desirous of everything at the same 
>>>>>>> time, the ones who never yawn or say a commonplace thing, but burn, 
>>>>>>> burn, 
>>>>>>> burn like fabulous yellow roman candles exploding like spiders across 
>>>>>>> the 
>>>>>>> stars.” -- Jack Kerouac
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "opencog" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>>
>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/opencog/CACYTDBeqdq0vixYq1M0kceBqyywkAvQMPsMOd51X-0V5Oagr2Q%40mail.gmail.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/opencog/CACYTDBeqdq0vixYq1M0kceBqyywkAvQMPsMOd51X-0V5Oagr2Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>>>>>         --Peter da Silva
>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "opencog" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>>>         --Peter da Silva
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "opencog" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/opencog/CAHrUA35Z%3DH1oSVFZ%3D-WTTATf4U9jhmfhMMAF6jNO1daTrbDXJg%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/opencog/CAHrUA35Z%3DH1oSVFZ%3D-WTTATf4U9jhmfhMMAF6jNO1daTrbDXJg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "opencog" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/opencog/e4efdf67-88d3-4163-9e98-78363fc6ed0a%40Canary
>>>>  
>>>> <https://groups.google.com/d/msgid/opencog/e4efdf67-88d3-4163-9e98-78363fc6ed0a%40Canary?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>>>
>>> -- 
>>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>>         --Peter da Silva
>>>
>>>
>>
>> -- 
>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>         --Peter da Silva
>>
>>
>
> -- 
> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>         --Peter da Silva
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/43cd0ac1-3764-4197-8b19-3818cda4fb63n%40googlegroups.com.

Re: Atomspace RAM & CPU usage ... was Re: [opencog-dev] Indexing in the AtomSpace

Reply via email to