Re: [FRIAM] Big data forensics

David Eric Smith Thu, 24 Jun 2021 18:09:26 -0700

I agree about gumshoes, Glen, but I think maybe the line in this that draws my 
interest is a sense that there is an enormous gap in scale that requires some 
other kind of design to fill it.  It would be of a piece with things that 
bugged me in past transitions.

I think for me it started the first time when public-key cryptography came out 
and Phil Zimmerman made the assertion “the Web of Trust will replace central 
authenticators”.  (No.)  I spent years trying to make a kind of sum-over-paths 
formulation of how confidence degrades in a web of trust without some kind of 
organizational algorithm, but how parallelism could be used to draw a more 
robust confidence from path combinatorics than from single central entities.  
(My thinking at the time was completely naive, but I won’t digress further 
here.)  That veered off into every possible conceptual obscurity, including 
what one means by “identity”, how practices like good key security should be 
taken to relate to judgment in public-key endorsements, how to use real world 
proof-of-stake to make costly signals, and yada yada yada.  When PageRank came 
out, and I tried to make a pitch to Pierre Omidyar on design of insurance 
schemes surrounding proof-of-value-at-risk and some kind of network-feedback, 
in a conversation on the SFI patio on a sunny lovely afternoon, Pierre was kind 
enough not to say flatly “You don’t understand _anything_ about how companies 
work”, which he was still steep on the learning curve about at the time.  But 
that never came to anything, except that the conceptual questions interested 
Martin Shubik enough that they began our collaboration (within which that was 
never a research question).

When the arXiv first came out and said "I will replace journals”, I thought 
that, once again, there is a bottleneck of human time and attention, and simply 
flooding everyone with everything all the time, while it will open new 
opportunities that gatekeepers keep closed off, it will cause random sampling 
error to replace systemic bias as the main cause of really suboptimal 
solutions.  It’s not that the need to take responsibility for understanding 
something oneself can ever be got around — that is a fundamental constraint — 
only that the concept of a “fiduciary” proposes that a good vetting and 
recommendation design can cause you to do the least-badly at balancing 
awareness and understanding possible, given the techniques of the time and your 
own preferences for how to split the two.  We are still stuck with journals 
because, bad as they are, we haven’t really designed and established 
alternatives that are enough better to displace the journals’ role.  F1000 and 
others were efforts in this direction, but they are distantly on the margins.

Then for languages, I fought the linguists for years, who wanted to extract 
single, microscopic, very-strong features of language that they could analyze 
to death, but which remain totally silent about almost-all questions of 
interest, because the strong signatures are few and the questions many.  I 
wanted probability methods to get distributional evidence about weak and 
distributed, but numerous and reinforcing, patterns of concordance in language. 
 That was the easiest idea, because we already know how to do it and it was 
just a matter of fighting a reactionary culture.  I think just the change of 
generations is already well on the way toward winning that battle, quite 
independently of any tiny contribution (if any at all) that we made.

For this one (the genomic forensics), I feel like we know who many of the 
organized actors are.  Governments will try to “control the narrative”, to the 
extent that they perceive doing so to be in their interest, and to the extent 
that the norms and institutions of the society give them cover in doing it.  
Governments that both are authoritarian and that depend on promulgating an 
ideology are probably the most committed to doing this comprehensively.  There 
are mid-level skirmishes, like between the US and Wikileaks, but I think 
because of the new horizons in big computing together with being able to seal 
off borders, China is pioneering a new frontier in this, which could be a “more 
is different” moment.  I can’t think of a counterpart to them anywhere else in 
the world just now.  The Russian model is quite different (I like things that 
Masha Gessen and Gary Kasparov say about that approach, granting that each of 
them has a POV); I have wondered how much confidence to attach to public health 
data coming-out of Vietnam, which is by many measures kind of an okay 
functioning society, but in which any building or billboard made of durable 
materials is still plastered with official slogans and propaganda.  (That one 
is not a case about which I know almost anything, so my cautions there are 
nearly empty.). 

Against these actors, we have other big actors, like intelligence agencies, and 
that is probably okay to produce some balance of power.  But they are all 
monoliths.

The few cases where we have interesting data for the viral question, from Yuri 
Deigin’s collaborators and now Jesse, are tiny data points acquired at large 
personal time and effort, guided by insight about particular questions.  I do 
feel like early sequence data are particularly high-value, because with what we 
currently can estimate about mutation rates, we could plug sequence-diversity 
data into back-of-the-envelop epidemiological models and try to get a sense of 
how much circulation there was in any community at any time, and try to back 
out timelines for founder infections, sort of like the LANL group did for HIV 
in the 1990s (?).  (That was Bette Korber, Tanmoy Bhattacharya, Alan Perelson, 
and their cohort, plus I am sure other groups that I don’t know.). 

Yet we must bs swimming in genome data of an incidental nature, like stray 
reads that end up in repositories that only by accident one would look for.  I 
continue to wonder if there is some “design” of a sieve that could automate or 
crowdsource some of these questions, so there could be a “public option” to go 
alongside the NSA/CIA vs. Governments dyad.  Viral genomics seems to be a 
problem whose structure is well-matched to distributed, public-data 
surveillance.

I worry that, as more of these discoveries come out, the government intrusion, 
micromanagement, and punitiveness toward academic and institute researchers in 
China is going to become just miserable.  Even it it was a wild outbreak, the 
essentially adversarial stance the CCP takes toward the rest of the world would 
cause them to suppress information, because they don’t trust the rest of the 
world not to draw motivated conclusions for the sake of working against them 
(and that is not an unreasonable fear; it’s where Miranda rights come from).  
So to the degree that an accurate and reasonably confident story could be put 
together for this problem, perhaps it would reduce the time this particular 
pain will be drawn out.  I would also like to think it could contribute to a 
sense that there are limits to what countries can expect to get away with.

Anyway, sorry, rambling.

Eric

> On Jun 24, 2021, at 11:07 PM, uǝlƃ ☤>$ <[email protected]> wrote:
> 
> It's a wonderful example of careful science. I only have 1 criticism. "There 
> is no plausible scientific reason for the deletion: the sequences are 
> perfectly concordant with the samples described in Wang et al.(2020a,b), 
> there are no corrections to the paper, the paper states human subjects 
> approval was obtained, and the sequencing shows no evidence of plasmid or 
> sample-to-sample contamination."
> 
> There's never *any* scientific reason to delete anything. So, the 1st clause 
> in the sentence is *merely* an attempt to rouse the rabble. 8^D Otherwise 
> known as "trolling". But buried under all the excellent, and excellently 
> hygienic, sentences in the paper, it makes that trawl more poignant and well 
> done.
> 
> Writ large, though, the phrase "systematic forensis" seems like a paradox. 
> The approach I take, inspired by systems engineering, is to *log* absolutely 
> everything, under version control, persistently. Rather than being a part of 
> systematic forensis, it *facilitates* forensis. In light of our conversation 
> on the myth of the objective, forensics imputes causality into a mesh of 
> events ... hunts down *the* criminal, *the* offending "$ shed -u" command. 
> Nothing brings that to the public forum quite like the gumshoe's 
> pavement-pounding response to her *hunch*.
> 
> It doesn't sound quite right to talk of systematic forensics. It sounds more 
> right to say systematic bookkeeping for the sake of more publicizing to the 
> forum.
> 
> On 6/23/21 9:42 PM, David Eric Smith wrote:
>> Speaking of big data forensics (which no-one was):
>> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.biorxiv.org%2fcontent%2f10.1101%2f2021.06.18.449051v1.full.pdf&c=E,1,Z2jur-l53hzcae_u0sVVK2Yah_YCSg4vsAiGyvyGj3M0Dxk8_Fiaubin1XCtfb2FPg6Bg2Z0vh4cuEYtpfO35SKzgDcydGRVeNxiqo3S&typo=1
>>  
>> <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.biorxiv.org%2fcontent%2f10.1101%2f2021.06.18.449051v1.full.pdf&c=E,1,7mqQ_rEsbqCRYjlOkyI7lQx-TLyZukWJfSNRmUefn14mK7BjxRCs3ftApro7zeGOknAUFPgdpxCR3I08pS70z-kQgvb83cqjGMvaH94J&typo=1>
>> 
>> [...]
>> I post because (apart from general interest), in the last paragraph of his 
>> introduction, he makes a call for data forensics to be done more 
>> systematically.
> 
> 
> -- 
> ☤>$ uǝlƃ
> 
> - .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
> FRIAM Applied Complexity Group listserv
> Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
> un/subscribe 
> https://linkprotect.cudasvc.com/url?a=http%3a%2f%2fredfish.com%2fmailman%2flistinfo%2ffriam_redfish.com&c=E,1,WZwFAwcrPPNkIivV6OPz36K2Bakg5sOjmfnXWYIN9WHbdeh46Pnn8R1xvFIslMFOQrhhndt2XuekmrckWlUNkoHQ1qDE_-avXYWjMxz5ET3zMcy_1Nw,&typo=1
> FRIAM-COMIC 
> https://linkprotect.cudasvc.com/url?a=http%3a%2f%2ffriam-comic.blogspot.com%2f&c=E,1,4nt08LoHHUnBctzODMoOEVViRBDc3MT2inSs-LshPQv5H3s04h5S9TxH66AROB_I7fl8uoWbbA-JlZhxMU3ZvTfP2g95-gtV2ayijdFvdrui&typo=1
> archives: http://friam.471366.n2.nabble.com/

- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
FRIAM-COMIC http://friam-comic.blogspot.com/
archives: http://friam.471366.n2.nabble.com/

Re: [FRIAM] Big data forensics

Reply via email to