Re: SRFI development in the age of git

Linas Vepstas Mon, 13 Jul 2020 14:41:11 -0700

On Mon, Jul 13, 2020 at 1:10 PM John Cowan <[email protected]> wrote:

>
>   Here's my favorite remarks about git, from the Plan 9 git filesystem's
> README:
>
> The most obvious difference is that Git's index is a bit boneheaded,
>
>
I suppose that is a valid comment.  If I may wander wildly off-topic ..
both bitcoin and git showed up at the same time, and both were built on the
same idea of append-only logs. git explicitly allowed multiple forks;
bitcoin explicitly forbade them. Both made design mistakes, as they were
early adopters. git wasn't sufficiently file-system-like-ish, (which is
plan9's strong point). A shame, since file-backup, restore,
corruption-protection from viruses, accidental deletion, etc. are a
"thing". For example, keeping the unix `/etc` in git is a life-saver for
system admins.  No matter; seems that guix and nix might have a superior
solution, anyway, from what I can tell.


bitcoin made two mistakes, not being file-system-ish enough, and not
defining a sufficiently generic compute platform (solved by ethereum.
Basically, bitcoin knows how to add and subtract; that's all that's needed
for accounting. Ethereum knows how to multiply and divide and branch and
tail-recurse). The not-being-a-filesystem choice is forced by anti-forking
properties of bitcoin/ethereum, since the blockchain grows rapidly and so
all commits must be tiny. By contrast, git allows not only large commits,
but any kind of code, c++, scheme, whatever. But never defines an execution
context for that code. In git, it's up to the user to download, compile,
install. It's not automated like in ethereum.

git fails to provide a fully-automated network filesystem. You have to
manually git-push, git-pull to get network updates. There's no
peer-discovery protocol (which is what started this email chain), and
resolving file conflicts is problematic during `git merge`. Also, git fails
to provide a directory search mechanism: if you don't know where the
content is, you can't search for it (as also complained about in this email
chain). github sort-of-ish solves this, but its proprietary. Compare this
to IPFS, which is full-network-automated, and does provide content search.
Unfortunately, IPFS doesn't have the append-only log structure of git. It
also uses DHT's, which are problematic, as they completely fail to deal
with locality. Enter scuttlebutt .. which provides a "true" decentralized
append-only log. The scuttlebutt core is fully generic (which is why
git-on-scuttlebutt is possible). However, 90% of scuttlebutt development
focus is on social media, not filesystems. Think of it as a kind-of
git-for-social-media posts. Or maybe a
web-browser-displaying-git-content-in-a-pretty-way.

The lack of proper indexes in git is severe, as is the lack of
content-based search. Once you get into search, you wander down the
rabbit-hole of query languages, pattern matching and pattern-mining. So
scheme (and most functional programming languages) have pattern-matchers.
For example, case-lambda but also define-syntax and syntax-case, or
racket's racket/match <https://docs.racket-lang.org/reference/match.html>
... but in a certain sense, these pathetically weak and limited in ability
and performance. Compare to, for example SQL -- it blows the doors off
syntax-case in usability and power. Never mind pattern-mining. And then we
have the issue of term-rewriting. So, for example, schemes' hygienic macros
do term re-writing, but they do it only once, when you start your scheme
code up for the first time. There is no ability to perform runtime
expansion of hygienic macros.  Macro languages are not run-time - again,
compare/contrast to SQL select/update.

Well, but SQL is obviously deficient -- its record-based, and if you
compare it to syntax-case, define-syntax, you will note that trees aka
s-expressions are what we really wanted. Pulling on that thread gives you
graph query languages, e.g. GraphQL for javascript... which is nice cause
json is kind-of-ish like typed s-expressions. Yes, scheme is explicitly
untyped, but don't knock types, they're really nice. The racket people are
onto something, something that ain't javascript or caml or haskell.

So, although graph query languages are vast improvements over plain-jane
pattern matchers, one can do better still... which brings me to what I work
on... the AtomSpace and Atomese (sorry, I hate camel-case, but that's
history now.) It's a graph database -- you can store arbitrary typed
s-expressions. It's a pattern matcher, but far more sophisticated than
GraphQL. It's a programming language, but is more like assembly or
byte-code, or an intermediate-language: low-level, not for humans, but for
other algos. It could've/should've been more javascript-like, but that's a
historical mistake. Maybe still fixable. It's vaguely prolog-like, and so
you could say minikanren-like, but it has a stronger runtime, and
generalizes truth values beyond true/false, so e.g. for Bayesian
probability or neural-nets. It's .. well, it's a graph database, it's
weakly distributed; there's some ongoing work, there, but the as mentioned,
DHT's are terrible in ensuring data locality. So if I have to e.g.
(case-lambda (foo)(..) (bar baz) (...)) I would rather that (foo) and (bar
baz) be on the same network node, rather than opposite sides of the planet.
But Kademlia puts them on opposite sides of the planet (because their
hashes are completely different), and I haven't been able to crack that
nut.

What does this have to do with zipf? Well, all patterns have a grammar.
This is chomsky's theorem from the 1960's or something. If you know the
grammar, you can parse text to see if it validly obeys the grammar.
Alternately, you can generate syntactically-valid random text from the
grammar.  But what if you don't know the grammar? Well, that's what
machine-learning and deep-learning is supposed to be for, except that
machine-learning could only learn very simple grammars (e.g.
decision-tree-forests) until it hit the wall. So deep-learning overcame
that wall, but it bypasses syntax by claiming its not important, or that it
is a black box, or whatever the pundits claim these days.  So I'm trying to
do pattern mining correctly.

But to do that, I need to validate that the syntax that is learned matches
the syntax used to generate the sample corpus. So I have to generate a
random syntax, generate a random text corpus from it, pattern match it,
deduce the syntax, and check precision/recall accuracy. Most networks in
nature (natural language, genomics, proteomics) seem to be
zipf-distributed. And so I need a zipf random generator.  So here we are...

FWIW genomes seem to be zipfian with exponent of 1/2 .. I have been unable
to find any explanation why it's 1/2 and not 1. ...!?

Sorry for long email. It's useful for me to articulate from time to time as
to what this is all about...

--linas

Re: SRFI development in the age of git

Reply via email to