Re: FW: Accessing snapshot.debian.org packages

Evilham Fri, 05 Jul 2019 00:44:26 -0700

Hello,

On dv., jul. 05 2019, Peter Palfrader wrote:

[moving to the list]
I know you said that 'Public mailinglists are the right pointof contact.',so I hope you'll forgive me for contacting you directly, but Ididn't get areply to my previous email and I would still like access to theDebian
archive data if possible (see original message below).
[email protected] is still the best place :)
Thanks for replying.

I'm a security researcher
[..]
I'm creating an application to identify versions of commonlibraries forsecurity purposes (to identify binaries that have associatedCVEs). The aimof this project is to identify binaries that need to be patched/ updated,
so they can't be exploited.
I would want to make a large number of requests initially topopulate thedatabase -- I would want to download every package file forabout the 100most popular packages for every version going back about 10-15years. Afterthat I would want to make minimal requests on a daily basis tocheck for new
versions or new files for the latest version of each package.
And yes, I am aware that many people will download the sourcecode and
compile it themselves.
There are two parts to the snapshot thing, each with its ownresource
constraints.
(a) On is everything that goes to the database. Which is prettymuchevery request except for see (b). Things have gottensomewhatbetter since we moved the DB for the secondary snapshotinstanceto a new host, but it's probably still not happy to behammered.
    Things that his the database are links like
    https://snapshot.debian.org/package/postgrey/
    
https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/
    
https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/postgrey_1.34-1.1.dsc
    https://snapshot.debian.org/archive/debian/?year=2009&month=11
    https://snapshot.debian.org/mr/...
    etc.
These requests are bound by database latency, and alsonumber ofconcurrent requests to the DBMS. Further, since the poolingclassin use is not exactly great, once a certain number ofrequests arein flight, things just fall over and everybody startsgettings 503s.
    Don't overload the DB :)

This is very good to know; in a previous email of mine to thislist I mentioned it should be noted on the documentation for theAPI, maybe a link to your email will be very desirable in additionto that.

(b) The only requests that do not hit the DB are requests to
    https://snapshot.debian.org/file/<sha1sum of file>
Those are cheap(ish). They are static files and apachefetches themdirectly from disk (NFS, but still). I wouldn't worry toomuchabout making a lot of them. Maybe not concurrently, butfetchingthem fast and sustained shouldn't cause too many issues. Ifthings
    fail, retry slowly?
Looking athttps://salsa.debian.org/snapshot-team/snapshot/raw/master/API,you'll probably need to make some requests to learn which filestodownload. Please do those one at a time, and spread them out?I don'tknow what a reasonable rate is that still lets you get what youneed.
How many requests do you think it'll need?
I guess there will be requests to /mr/package/<package>/ for"the 100most popular packages", so that's reasonably small. And thenmaybe/mr/package/<package>/<version>/allfiles to learn all about thefiles?So that'd be once per package per version. Any guess how manythat'dbe? 10k 100k? How many versions does the average "popularpackage"have? A request every few seconds should still get you what youneed in
a reasonable time?

Once you have that info, it should all just be file downloads?

Mostly, there is also /mr/file/<hash>/info to know the file name,size and first-seen date; from my tests IIRC 3-4 requests areneeded to have full information to actually download a file or todecide if it is to be downloaded.

Since I work for a large company, I do have access toresources, includingfunding to cover expenses (for hardware, travel, servicesetc.), if that
helps. I hope that my indirect reference to money doesn't seem
inappropriate, I was just thinking there might be a way toaccess the data
that would incur some costs.
I'm not sure that throwing money at the problem currently wouldhelpmuch. It's mainly a manpower issue as few (if any) other peopleeverlook after snapshot and I don't really have any time for iteither.

Not necessarily wrong either, and roughly on the side of what Ialso mentioned on a previous email:

Another option, which may not be feasible, would be to make thedbavailable for download and give people the ability to processthat
on their own; is a db dump (without the packages) huge?

If this were practicable, even if the dump only happened*somewhat* often (once a week? once a month? depends on the data),it'd allow people/organisations to, e.g. locally replicate the APIservice (DB included) and only hit snapshot.debian.org for filedownloads if absolutely necessary and not already cached.

That would enable some use-cases and also allow people withoutaccess to snapshot.debian.org to contribute to improving theservice; e.g. by modifying the software contacting the databasewithout having the whole archive locally.

If this were not practicable and someone needs to be running thesekind of analysis, it could be possible to dump the database tohard drives that are sent around by postal service/personally;sadly this would require some manual intervention and thereforecan't happen too often since all volunteers' time is limited.

Both of these options, and probably others I'm not seeing, canonly be done with some resources, so I don't think the mention ofresource availability is out of order.

All of this, and what I mentioned in my previous email is becausethe data in snapshot.debian.org has huge potential, but it alsocurrently has a very high barrier of entry.

--
Evilham

Re: FW: Accessing snapshot.debian.org packages

Reply via email to