Re: [opencog-dev] Indexing in the AtomSpace

Linas Vepstas Thu, 27 Aug 2020 11:43:31 -0700

On Thu, Aug 27, 2020 at 12:55 PM Abdulrahman Semrie <[email protected]>
wrote:


> > A second is to create a UniProtNode and use that; queries are then
> simple because you just ask for all UniprotNodes.
>
> We are already using this approach. We have added new, data-source
> specific types to the atomspace and we use those types in pattern matching
> query.
>
> > A third (recommended) way is to write  (MemberLink (Node "Uniprot:
> 1234") (Concept "the-set-of-all-uniprots"))
>
> can you please explain why this approach is recommended compared to the
> second one? Doesn't using this approach add many links that can be avoided
> by having a specific type?
>

The second approach is problematic, because it requires a human being to
create, edit and publish an "atom_types.script" file to github ... and then
download, compile, install. That's a hassle.

It is possible to dynamically add new Atom types at run-time, but the API's
for this are incomplete; this is not encouraged, because it raises a
variety of other technical issues (which I can review, but is a bit
off-topic).

The third approach (of using MemberLinks and/or EvaluationLinks) is
"recommended"  because it can be done at run-time, and has a giant raft of
neat features, bells and whistles for doing all kinds of stuff. It's very
flexible.

The disadvantage of using MemberLinks is slightly larger RAM usage: about
100 bytes for the link itself, and maybe another 50 bytes for the
incoming-set index. (roughly; it's hard to measure because these bytes are
not all in one place; some are in c++ std::set internal  nodes, etc.)  This
implies that one million MemberLinks require 150MB extra RAM which ...
well, that seems reasonable, these days.  Defining a UniProtNode does save
you this RAM.

Another disadvantage of using MemberLinks during search is that it requires
one extra graph-walk step during pattern matching -- i.e. walking from
(Node "Uniprot: 1234") to the (Concept "the-set-of-all-uniprots")) via the
MemberLink. This walk does take many thousands of cycles and lord-knows how
many cpu-cache misses.   Based on my measurements from January with the
actual bio-agi/mozi datasets in use at that time, I had the impression that
this extra step accounts for 5% to 25% extra CPU, depending on the query.
Something like that. And since the queries you are doing are extremely
common, frequent and cpu-intensive, it makes sense to hand-optimize and
performance-tune, i.e. to invent a new UniProtNode.

But both of these considerations are very highly specific to the
agi-bio/mozi project, where you are willing to put in the human labor costs
to optimize for performance. More generally, one must recognize that human
labor is expensive, and CPU-cycles/RAM is cheap, so the generic answer has
to be "buy more RAM, buy more CPU" because that is cheaper than paying
humans.  That's why the third way is the recommended way.


> > . unless you mean "can I ask if (Node "uniprot: 1234") exists, without
> accidentally creating it if it does not?"
>
> More like "can I ask if any node with name "uniprot:1234" exists? If so,
> can you return that node."
>
> > you can do this from the C++, scheme and python API's, but you cannot do
> this in Atomese.
>
> If I know the type and the name, yes I can do this from the C++, scheme
> and python - I'm actually doing this in the C++ code for the rpc server.
> But in the case I'm describing, I only know the name and not the type. And
> to create a Handle to retrieve the atom, I need both the type and the name.
>

So, without knowing the type, but only knowing the string name? My
knee-jerk reaction is you're doing something wrong, if you feel you need to
do that.  You've mis-designed some data representation, somehow.

But if you really, really want to ... you can get a list of all atom types,
and then ask the atomspace, one type at a time, if it has a node with that
name.

Another possibility is to "create an index" via the third (recommended) way:

(MemberLink (ProteinNode "unprot: 1234") (Concept "things named 1234"))
(MemberLink (GoNode "structure: 1234") (Concept "things named 1234"))
(MemberLink (PathwayNode "pathway: 1234") (Concept "things named 1234"))

And then there's RegexNode .... which is theoretically ugly, because it is
a crutch, a work-around to a mis-designed data representation. The
theoretical ugliness in a nut-shell: if you really want to use strings,
then don't use the atomspace, use perl. That' is what perl is for. Regex's
are finite state machines, and the comp-sci industry has well-developed
theories of how finite-state machines work, and what you can do with them.
More generally, there is a well-defined theory of string manipulation and
string-rewriting... adding this to the atomspace is .... its ...  well,
it's like trying to surgically attach that machinery ... it's like, .. I
dunno, building a car with three wheels powered by a gasoline engine and
one wheel powered by an electric motor. You can do it, but why?

--linas


>
> On Thursday, August 27, 2020 at 8:33:29 PM UTC+3 linas wrote:
>
>> I just provided three different solutions to that task... -- linas
>>
>> On Thu, Aug 27, 2020 at 11:14 AM Ben Goertzel <[email protected]> wrote:
>>
>>>
>>> I think perhaps what Xabush wants is to be able to query
>>>
>>> " Find me all Atoms whose name string contains the substring "ABDPDQ".  "
>>>
>>> even if he doesn't know what types these Atoms may be ?
>>>
>>> ben
>>>
>>> On Thu, Aug 27, 2020 at 9:09 AM Linas Vepstas <[email protected]>
>>> wrote:
>>>
>>>> This statement I find confusing: "I can’t write a pattern matching
>>>> query to retrieve an atom using its id/name" There is one and only one such
>>>> atom, ever, by definition... There is nothing to query; if you know the
>>>> name, you know the atom.
>>>>
>>>> There was talk previously about "substring matching", for example, you
>>>> have atoms  named "Uniprot: 1234" and "Uniprot: 5678" and you want to find
>>>> all atoms that start with the eight characters "Uniprot:". There are (at
>>>> least) three solutions for this. One is to create a RegexNode, but this is
>>>> ugly from a theoretical standpoint. A second is to create a UniProtNode and
>>>> use that; queries are then simple because you just ask for all
>>>> UniprotNodes.  A third (recommended) way is to write  (MemberLink (Node
>>>> "Uniprot: 1234") (Concept "the-set-of-all-uniprots"))
>>>>
>>>> This third way is recommended because, in a sense, the atomspace is
>>>> nothing but one giant network of interconnected partial indexes. There is
>>>> an index from (Node "Uniprot: 1234") to everything that makes use of it --
>>>> its called "the incoming set" and it is a real index - a c++ std::set  if I
>>>> recall. Same for (Concept "the-set-of-all-uniprots") and what the pattern
>>>> matcher "actually does" is to stitch together these partial indexes into a
>>>> whole, and then prune away the irrelevant parts.
>>>>
>>>> -- Linas
>>>>
>>>> ... unless you mean "can I ask if (Node "uniprot: 1234") exists,
>>>> without accidentally creating it if it does not?" ... you can do this from
>>>> the C++, scheme and python API's, but you cannot do this in Atomese.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 27, 2020 at 4:07 AM Abdulrahman Semrie <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> TL;DR: you can already do that.  It's already supported.
>>>>>
>>>>> It’s partially supported. As you’ve described, we can cache the result
>>>>> of a pattern matching query and it is already supported. However, since I
>>>>> can’t write a pattern matching query to retrieve an atom using its id/name
>>>>> from the atomspace, there is no way to cache/index. If there was some
>>>>> ExistsLink that inherits from QueryLink where you can use to retrieve
>>>>> an atom by its name if it exists or return a false truth value, then what
>>>>> you’ve described can be done.
>>>>>
>>>>> —
>>>>>
>>>>> Regards,
>>>>>
>>>>> Abdulrahman Semrie
>>>>> <https://canarymail.io>
>>>>>
>>>>> On Thursday, Aug 27, 2020 at 2:46 AM, Linas Vepstas <
>>>>> [email protected]> wrote:
>>>>> TL;DR: you can already do that.  It's already supported.
>>>>>
>>>>> Please follow me on this train of thought.
>>>>>
>>>>> 1) What is an "index"? Well, its a pre-defined cache of all atoms of
>>>>> some shape or pattern.
>>>>>
>>>>> 2) How can one specify an index?  Well, if its a pattern, then a
>>>>> pattern query can be used.
>>>>>
>>>>> 3) Where should the index be stored, or kept? Well, it can be stored
>>>>> or kept with the pattern that defines the shape of the index.
>>>>>
>>>>> Before I move on to the next thought, let me point out that 1-2-3 can
>>>>> be directly solved today. Define a pattern, e.g. a query link. Run it.
>>>>> Store the results on the query, as a value. You can "do this yourself",
>>>>> today, its easy, but it becomes even easier if you are willing to read the
>>>>> docs for `cog-execute-cache!` (appended below)
>>>>>
>>>>> 4) How should the index be updated? Ah, well, that is actually the
>>>>> tricky question, the hard question, the place where all of the interesting
>>>>> technology debates and thinking are centered.  One strategy is to update
>>>>> the index every single time an Atom is added to/removed from the 
>>>>> atomspace.
>>>>> But recomputing the index every time is wildly inefficient, burning 
>>>>> through
>>>>> vast quantities of CPU time. What else can one do? Well, maybe recompute 
>>>>> on
>>>>> demand. Or recompute every few minutes. Or maybe once a night. (aka
>>>>> "eventually consistent")  Maybe store a time-stamp on the index, to tell
>>>>> you how old it is. Or maybe have an append-only log of atomspace 
>>>>> changes...
>>>>> I can propose many different kinds of solutions. They all have space and
>>>>> time-overhead, and/or assorted usability issues. Which of these best suits
>>>>> your needs, I have trouble guessing, so you would have to explain what the
>>>>> problem is (if any).
>>>>>
>>>>> --linas
>>>>>
>>>>> Here's the docs:
>>>>>  cog-execute-cache! EXEC KEY [METADATA [FRESH]]
>>>>>
>>>>>    Execute or return cached execution results. This is a caching
>>>>> version
>>>>>    of the `cog-execute!` call.
>>>>>
>>>>>    If the optional FRESH boolean flag is #f, then if there is a Value
>>>>>    stored at KEY on EXEC, return that Value. The default value of FRESH
>>>>>    is #f, so the default behavior is always to return the cached value.
>>>>>    If the optional FRESH boolean flag is #t, or if there is no Value
>>>>>    stored at KEY, then the `cog-execute!` function is called on EXEC,
>>>>>    and the result is stored at KEY.
>>>>>
>>>>>    The METADATA Atom is optional.  If it is specified, then metadata
>>>>>    about the execution is placed on EXEC at the key METADATA.
>>>>>    Currently, this is just a timestamp of when this execution was
>>>>>    performed. The format of the meta-data is subject to change; this
>>>>>    is currently an experimental feature, driven by user requirements.
>>>>>
>>>>>    At this time, execution is synchronous. It may be worthwhile to have
>>>>>    an asynchronous version of this call, where the execution is
>>>>> performed
>>>>>    at some other time. This has not been done yet.
>>>>>
>>>>> On Wed, Aug 26, 2020 at 7:41 AM Abdulrahman Semrie <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> In the current atomspace, atoms are indexed by their type, i.e given
>>>>>> a type we can retrieve all the atoms that have that type. But there is no
>>>>>> other away of adding custom indices in the atomspace. For example, if we
>>>>>> want to index nodes by their name, there is no way of doing this.
>>>>>>
>>>>>> As discussed in this issue
>>>>>> <https://github.com/MOZI-AI/annotation-scheme/issues/192>, we plan
>>>>>> to expand the annotation-service, which uses the AtomSpace to store
>>>>>> genomics data, to support the annotation of more types in addition to
>>>>>> genes. Currently, when I user submits a list of ids to the service, it is
>>>>>> assumed that these ids/symbols represent `GeneNode`s. But in the case 
>>>>>> where
>>>>>> the input can be a protein, a drug molecule, pathway or a gene, there is 
>>>>>> no
>>>>>> direct way of retrieving what type of the atom with the given name is
>>>>>> unless we iterate through all atoms searching for that particular id. 
>>>>>> This
>>>>>> isn't be a good approach from performance standpoint. But if we had a
>>>>>> custom index - e.g `name_index`, on the ids/names of the atoms, it will 
>>>>>> be
>>>>>> easier to search the atoms by name and identify the type that the atom
>>>>>> belongs to.
>>>>>>
>>>>>> Hence, if there is a way to add custom indices to the atomspace, it
>>>>>> will greatly simplify some searches. Or maybe there is a way to do what I
>>>>>> described above without the need for an index. If so, please share it.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "opencog" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/opencog/27892502-0dfb-4042-a805-30a1520f6250n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>>>>         --Peter da Silva
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "opencog" group.
>>>>> To unsubscribe from this topic, visit
>>>>> https://groups.google.com/d/topic/opencog/5uE2lw6b-5E/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/opencog/CAHrUA34qoTA90pcSC3GwXsGy8xpK5yn-1U7k%2Ba10nuDTWcrBLQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "opencog" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary
>>>>> <https://groups.google.com/d/msgid/opencog/2a5214b7-c083-40c0-801d-0a3595783046%40Canary?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>>
>>>> --
>>>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>>>         --Peter da Silva
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "opencog" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/opencog/CAHrUA37N%3Dbjr7QDQzS-uUpcwaSP%3D44QEYfkmUXQC9mrVEZATEQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>>>
>>> --
>>> Ben Goertzel, PhD
>>> http://goertzel.org
>>>
>>> “The only people for me are the mad ones, the ones who are mad to live,
>>> mad to talk, mad to be saved, desirous of everything at the same time, the
>>> ones who never yawn or say a commonplace thing, but burn, burn, burn like
>>> fabulous yellow roman candles exploding like spiders across the stars.” --
>>> Jack Kerouac
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "opencog" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/opencog/CACYTDBeqdq0vixYq1M0kceBqyywkAvQMPsMOd51X-0V5Oagr2Q%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/opencog/CACYTDBeqdq0vixYq1M0kceBqyywkAvQMPsMOd51X-0V5Oagr2Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
>>         --Peter da Silva
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com
> <https://groups.google.com/d/msgid/opencog/8e6d763a-9b4d-4a68-810e-d6f16e80e118n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35Z%3DH1oSVFZ%3D-WTTATf4U9jhmfhMMAF6jNO1daTrbDXJg%40mail.gmail.com.

Re: [opencog-dev] Indexing in the AtomSpace

Reply via email to