Re: SRFI development in the age of git

Linas Vepstas Tue, 14 Jul 2020 15:55:49 -0700

On Mon, Jul 13, 2020 at 7:38 PM John Cowan <[email protected]> wrote:

>
> Just cherry-picking a few points...
>
> On Mon, Jul 13, 2020 at 5:40 PM Linas Vepstas <[email protected]>
> wrote:
>
> Compare to, for example SQL -- it blows the doors off syntax-case in
>> usability and power.
>>
>
> Well, no; syntax-case allows arbitrary Scheme, so it is Turing-complete.
> SQL is not, unless the implementation of CTEs allows arbitrary nesting.
> SQL is also extremely rigid, deficient, and un-orthogonal compared to a
> true relational algebra implementation like Tutorial D.
>
> See also Linda, in which you broadcast arbitrary tuples (could be trees,
> too) into Lindaspace and then anyone can query the space with pattern
> matching, returning the first matching tuple with or without atomically
> removing it.
>


Yes, SQL is deficient, which is why graph query languages exist, and why
the atomspace got created. To keep things concrete, here's a
bio-grid/reactome/chebi data annotation package:
https://github.com/MOZI-AI/annotation-scheme -- its currently being used
for covid research.

Typical datasets contain something approx 10 million s-expressions, e.g.
a million of these biogrid's:
(Evaluation (Predicate "interacts_with") (List (Gene "FLNC") (Gene
"MAP2K4")))
(Evaluation (Predicate "has_entrez_id") (List (Gene "MAP2K4") (Concept
"entrez:6416")))

several  million of these chebi's
(Member (Molecule "ChEBI:16977") (Concept "SMP0000055"))
(Evaluation (Predicate "has_name") (List (Molecule "ChEBI:16977") (Concept
"(2S)-2-aminopropanoic acid")))
etc.

Basically, they are small, very low-complexity patterns, just that there's
a lot of them.

two heavy-hitter queries include what I call "the triangle": given gene A,
find genes B and C such that A interacts with B interacts with C interacts
with A. (They've intentionally confused upregulation with downregulation
for some reason I don't understand).  Another is that I call the
"pentagon": genes A and B interact, they express proteins P and Q, which
are in the same reactome R.

The triangle queries currently take maybe an hour(?) on a five-year-old
compute node; the pentagon queries take maybe 6 hours(?) (I've forgotten.)
So, as a point of practical application: can I load 10 million relations
into Tutorial D or into Linda, and run the triangle/pentagon pattern
matches? (I don't see how to use either syntax-case or how to use srfi-200
to perform these queries. Or rather, I haven't thought it worthy to devote
time to figure out how to do this, as they don't seem appropriate for this
problem.)

(I admit I've never heard of Tutorial D or Linda before, will look.)

--linas

Re: SRFI development in the age of git

Reply via email to