Hello,
On dv., jul. 05 2019, Peter Palfrader wrote:
[moving to the list]
I know you said that 'Public mailinglists are the right point
of contact.',
so I hope you'll forgive me for contacting you directly, but I
didn't get a
reply to my previous email and I would still like access to the
Debian
archive data if possible (see original message below).
[email protected] is still the best place :)
Thanks for replying.
I'm a security researcher
[..]
I'm creating an application to identify versions of common
libraries for
security purposes (to identify binaries that have associated
CVEs). The aim
of this project is to identify binaries that need to be patched
/ updated,
so they can't be exploited.
I would want to make a large number of requests initially to
populate the
database -- I would want to download every package file for
about the 100
most popular packages for every version going back about 10-15
years. After
that I would want to make minimal requests on a daily basis to
check for new
versions or new files for the latest version of each package.
And yes, I am aware that many people will download the source
code and
compile it themselves.
There are two parts to the snapshot thing, each with its own
resource
constraints.
(a) On is everything that goes to the database. Which is pretty
much
every request except for see (b). Things have gotten
somewhat
better since we moved the DB for the secondary snapshot
instance
to a new host, but it's probably still not happy to be
hammered.
Things that his the database are links like
https://snapshot.debian.org/package/postgrey/
https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/
https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/postgrey_1.34-1.1.dsc
https://snapshot.debian.org/archive/debian/?year=2009&month=11
https://snapshot.debian.org/mr/...
etc.
These requests are bound by database latency, and also
number of
concurrent requests to the DBMS. Further, since the pooling
class
in use is not exactly great, once a certain number of
requests are
in flight, things just fall over and everybody starts
gettings 503s.
Don't overload the DB :)
This is very good to know; in a previous email of mine to this
list I mentioned it should be noted on the documentation for the
API, maybe a link to your email will be very desirable in addition
to that.
(b) The only requests that do not hit the DB are requests to
https://snapshot.debian.org/file/<sha1sum of file>
Those are cheap(ish). They are static files and apache
fetches them
directly from disk (NFS, but still). I wouldn't worry too
much
about making a lot of them. Maybe not concurrently, but
fetching
them fast and sustained shouldn't cause too many issues. If
things
fail, retry slowly?
Looking at
https://salsa.debian.org/snapshot-team/snapshot/raw/master/API,
you'll probably need to make some requests to learn which files
to
download. Please do those one at a time, and spread them out?
I don't
know what a reasonable rate is that still lets you get what you
need.
How many requests do you think it'll need?
I guess there will be requests to /mr/package/<package>/ for
"the 100
most popular packages", so that's reasonably small. And then
maybe
/mr/package/<package>/<version>/allfiles to learn all about the
files?
So that'd be once per package per version. Any guess how many
that'd
be? 10k 100k? How many versions does the average "popular
package"
have? A request every few seconds should still get you what you
need in
a reasonable time?
Once you have that info, it should all just be file downloads?
Mostly, there is also /mr/file/<hash>/info to know the file name,
size and first-seen date; from my tests IIRC 3-4 requests are
needed to have full information to actually download a file or to
decide if it is to be downloaded.
Since I work for a large company, I do have access to
resources, including
funding to cover expenses (for hardware, travel, services
etc.), if that
helps. I hope that my indirect reference to money doesn't seem
inappropriate, I was just thinking there might be a way to
access the data
that would incur some costs.
I'm not sure that throwing money at the problem currently would
help
much. It's mainly a manpower issue as few (if any) other people
ever
look after snapshot and I don't really have any time for it
either.
Not necessarily wrong either, and roughly on the side of what I
also mentioned on a previous email:
Another option, which may not be feasible, would be to make the
db
available for download and give people the ability to process
that
on their own; is a db dump (without the packages) huge?
If this were practicable, even if the dump only happened
*somewhat* often (once a week? once a month? depends on the data),
it'd allow people/organisations to, e.g. locally replicate the API
service (DB included) and only hit snapshot.debian.org for file
downloads if absolutely necessary and not already cached.
That would enable some use-cases and also allow people without
access to snapshot.debian.org to contribute to improving the
service; e.g. by modifying the software contacting the database
without having the whole archive locally.
If this were not practicable and someone needs to be running these
kind of analysis, it could be possible to dump the database to
hard drives that are sent around by postal service/personally;
sadly this would require some manual intervention and therefore
can't happen too often since all volunteers' time is limited.
Both of these options, and probably others I'm not seeing, can
only be done with some resources, so I don't think the mention of
resource availability is out of order.
All of this, and what I mentioned in my previous email is because
the data in snapshot.debian.org has huge potential, but it also
currently has a very high barrier of entry.
--
Evilham