Bug#1020217: S3-backed snapshot implementation
On 06/05/24 at 14:40 +0200, Simon Josefsson wrote: > Lucas Nussbaum writes: > > > - I got the OK to host a S3-backed snapshot mirror using the Debian AWS > > account (see thread in #1020217) > > Is this s3 bucket public, or will it be? It's my plan to make it public, yes > I have been worried about the state of snapshot and I am mirroring its > data into local Git LFS. Since snapshot.debian.org doesn't support > rsync and don't make the postgres database dumps available (so that I > can identify SHA1 objects and speed up downloads), I am using HTML web > scraping to find out what files exists to snapshot.d.o. If you are a DD, you could: ssh lw08.debian.org psql service=snapshot-guest -c '\\dt' List of relations Schema |Name | Type | Owner +-+---+-- public | archive | table | snapshot public | binpkg | table | snapshot public | config | table | snapshot public | directory | table | snapshot public | farm_journal| table | snapshot public | file| table | snapshot public | file_binpkg_mapping | table | snapshot public | file_srcpkg_mapping | table | snapshot public | indexed_mirrorrun | table | snapshot public | mirrorrun | table | snapshot public | node| table | snapshot public | removal_affects | table | snapshot public | removal_log | table | snapshot public | srcpkg | table | snapshot public | symlink | table | snapshot (15 rows) The 'file' table is the one that lists all known hashes. Lucas
Bug#1020217: S3-backed snapshot implementation
Lucas Nussbaum writes: > - I got the OK to host a S3-backed snapshot mirror using the Debian AWS > account (see thread in #1020217) Is this s3 bucket public, or will it be? I have been worried about the state of snapshot and I am mirroring its data into local Git LFS. Since snapshot.debian.org doesn't support rsync and don't make the postgres database dumps available (so that I can identify SHA1 objects and speed up downloads), I am using HTML web scraping to find out what files exists to snapshot.d.o. My goal has been to put all the Git LFS objects in a publicly-accessible S3 bucket too. While imports were running I didn't work on the bucket side, and I suspect my download will take months to complete at current speeds. I publish Git LFS versions of archive.debian.org, ftp.debian.org and ftp.ports.debian.org already, though, so perhaps I could start on the bucket publishing part for them and see about adding an incremental snapshot.d.o copy while it is still working. /Simon signature.asc Description: PGP signature
Bug#1020217: S3-backed snapshot implementation
Hi, I made some minor progress on this, and I thought I'd report back (I'll try to attend the meeting tomorrow, but I'm not sure I'll manage). # What I did: - I got the OK to host a S3-backed snapshot mirror using the Debian AWS account (see thread in #1020217) - I got access to the account, and set up a VM with a Debian mirror. - I could run the file-backed snapshot importer on it - I modified the snapshot importer code to make it import to S3 (basically it means creating an S3Backend class that inherits from StockageBackend), and tested it by importing the debian-security archive. # What I plan to work on: - Set up a real development environment. I plan to use Vagrant, which is not a perfect solution for many reasons, but anyway the provisioning scripts will likely be re-usable with something else. - Change the web frontend to allow using S3. - Improve (parallelize) the importer code, specifically the sha1-hashing (to process multiple files in parallel, one per core) and the file copying/uploading-to-S3 (this is especially important for S3 because, to achieve good throughput, you need many transfers in parallel). # Open questions ## What to do with this? Assuming all this works and we can have a S3-backed snapshot service, there's the question of what to do with it. We have several options I think: ### A. s3-snapshot as a mirror of snapshot.debian.org The imports would continue to be done on snapshot.debian.org, but everything would be mirrored on a regular basis to S3. That would allow faster access to the data, but would not help with the performance of imports. ### B. dual-stack snapshot.debian.org The importer on snapshot.debian.org would import both to local stockage, and to s3. The web app could proxy requests to both. That would allow more resilience, but does not help with the performance of imports (on the contrary). ### C. s3-snapshot as a fork of snapshot.debian.org After an initial import of snapshot.debian.org data, s3-snapshot would live its own independent life. The main downside is that both databases will become out of sync (not the same mirror runs; they might each miss some packages, but not the same ones). ### D. do both at the same time Do C, but also make sure that every file that ever gets stored in snapshot.debian.org gets imported in the bucket used for s3-snapshot, to be able to expose a full read-only mirror of the snapshot.debian.org DB. ### E. Nice experiment, but let's forget about it (That should be mentioned as well) In any case, it probably makes sense to keep at least two different instances of the snapshot service (and data) on preferably different implementations, to make sure that we don't lose everything in case of catastrophic incident. I plan to aim for C as a first step. ## How to do an initial import of snapshot.debian.org data? That's more a technical question. The PostgreSQL DB should not be a problem, as it's quite small (~ 20 GB). For the data itself, it could probably be uploaded directly from local storage on the snapshot.d.o hosts to a S3 bucket. I could upload from an EC2 VM to S3 at about 10 Gbps (limited by the bandwidth of local storage). I don't know about the performance (storage, network) of snapshot.debian.org, but that probably means that an import into S3 is doable in a couple of weeks in the worst case. In any case, that's a question to keep in mind, but that does not need to be resolved now. Lucas
Bug#1020217: S3-backed snapshot implementation on AWS?
On Sun, Sep 24, 2023 at 04:09:31PM -0700, Noah Meyerhans wrote: > > > Could we use the Debian AWS account to host that service? > > > > I would assume that a service like snapshot would be within the scope > > for our AWS usage. Noah? > > It makes sense and I will look into it. Let's not start anything until > we hear definitive confirmation. OK, let's do it. noah
Bug#1020217: S3-backed snapshot implementation on AWS?
Hi, On 24/09/23 at 16:09 -0700, Noah Meyerhans wrote: > On Fri, Sep 22, 2023 at 05:12:21PM +0200, Bastian Blank wrote: > > > Could we use the Debian AWS account to host that service? > > > > I would assume that a service like snapshot would be within the scope > > for our AWS usage. Noah? > > It makes sense and I will look into it. Let's not start anything until > we hear definitive confirmation. Do we have a sense of how much > outgoing traffic the current snapshot service generates? >From #debian-admin: lucas: https://munin.debian.org/debian.org/sallinen.debian.org/ip_193_62_202_27.html and https://munin.debian.org/debian.org/sallinen.debian.org/ip_2001_630_206_4000_1a1a_0_c13e_ca1b.html I think, so average of 35Mbit/sec over the last week. > > However we need to talk about that "one […] VM", because this sounds > > like you intend to use AWS as VM hosting, which it is not. > > > > Please think about this in form of services and there should be at least > > two: > > - the injestor, which can only exist once and writes, and > > - the web frontend, which should be able to exist several times and only > > reads. > > > > So you want to plan with running the multiple web frontends with load > > balancers and maybe even cloudfront. > > I agree that it would be best to design something more cloud-oriented. > However, if there's an existing infrastructure that can be moved as a > "lift & shift" into AWS now, with architectural refactoring happening > later, that's an OK place to start. Yes, that would be the plan I think: start with moving to AWS and replacing the filesystem-backed storage backend to an S3-backed on. Then look at other aspects. Lucas
Bug#1020217: S3-backed snapshot implementation on AWS?
On Fri, Sep 22, 2023 at 05:12:21PM +0200, Bastian Blank wrote: > > Could we use the Debian AWS account to host that service? > > I would assume that a service like snapshot would be within the scope > for our AWS usage. Noah? It makes sense and I will look into it. Let's not start anything until we hear definitive confirmation. Do we have a sense of how much outgoing traffic the current snapshot service generates? > > It would > > require one fairly powerful VM, and a large S3 bucket (approximately > > 150-200 TB). > > 200 TB should be no problem. Agreed. > However we need to talk about that "one […] VM", because this sounds > like you intend to use AWS as VM hosting, which it is not. > > Please think about this in form of services and there should be at least > two: > - the injestor, which can only exist once and writes, and > - the web frontend, which should be able to exist several times and only > reads. > > So you want to plan with running the multiple web frontends with load > balancers and maybe even cloudfront. I agree that it would be best to design something more cloud-oriented. However, if there's an existing infrastructure that can be moved as a "lift & shift" into AWS now, with architectural refactoring happening later, that's an OK place to start. noah
Bug#1020217: S3-backed snapshot implementation on AWS?
Hi Lucas On Fri, Sep 22, 2023 at 08:42:10AM +0200, Lucas Nussbaum wrote: > Could we use the Debian AWS account to host that service? I would assume that a service like snapshot would be within the scope for our AWS usage. Noah? > It would > require one fairly powerful VM, and a large S3 bucket (approximately > 150-200 TB). 200 TB should be no problem. However we need to talk about that "one […] VM", because this sounds like you intend to use AWS as VM hosting, which it is not. Please think about this in form of services and there should be at least two: - the injestor, which can only exist once and writes, and - the web frontend, which should be able to exist several times and only reads. So you want to plan with running the multiple web frontends with load balancers and maybe even cloudfront. Regards, Bastian -- I object to intellect without discipline; I object to power without constructive purpose. -- Spock, "The Squire of Gothos", stardate 2124.5
Bug#1020217: S3-backed snapshot implementation on AWS?
Hi Bastian, I'm playing with the idea of a S3-backed snapshot.d.o implementation (see #1020217). Could we use the Debian AWS account to host that service? It would require one fairly powerful VM, and a large S3 bucket (approximately 150-200 TB). Best, Lucas