Bug#1020217: S3-backed snapshot implementation

2024-05-06 Thread Lucas Nussbaum
On 06/05/24 at 14:40 +0200, Simon Josefsson wrote:
> Lucas Nussbaum  writes:
> 
> > - I got the OK to host a S3-backed snapshot mirror using the Debian AWS
> >   account (see thread in #1020217)
> 
> Is this s3 bucket public, or will it be?

It's my plan to make it public, yes

> I have been worried about the state of snapshot and I am mirroring its
> data into local Git LFS.  Since snapshot.debian.org doesn't support
> rsync and don't make the postgres database dumps available (so that I
> can identify SHA1 objects and speed up downloads), I am using HTML web
> scraping to find out what files exists to snapshot.d.o.

If you are a DD, you could:

ssh lw08.debian.org psql service=snapshot-guest -c '\\dt'
List of relations
 Schema |Name | Type  |  Owner   
+-+---+--
 public | archive | table | snapshot
 public | binpkg  | table | snapshot
 public | config  | table | snapshot
 public | directory   | table | snapshot
 public | farm_journal| table | snapshot
 public | file| table | snapshot
 public | file_binpkg_mapping | table | snapshot
 public | file_srcpkg_mapping | table | snapshot
 public | indexed_mirrorrun   | table | snapshot
 public | mirrorrun   | table | snapshot
 public | node| table | snapshot
 public | removal_affects | table | snapshot
 public | removal_log | table | snapshot
 public | srcpkg  | table | snapshot
 public | symlink | table | snapshot
(15 rows)

The 'file' table is the one that lists all known hashes.

Lucas



Bug#1020217: S3-backed snapshot implementation

2024-05-06 Thread Simon Josefsson
Lucas Nussbaum  writes:

> - I got the OK to host a S3-backed snapshot mirror using the Debian AWS
>   account (see thread in #1020217)

Is this s3 bucket public, or will it be?

I have been worried about the state of snapshot and I am mirroring its
data into local Git LFS.  Since snapshot.debian.org doesn't support
rsync and don't make the postgres database dumps available (so that I
can identify SHA1 objects and speed up downloads), I am using HTML web
scraping to find out what files exists to snapshot.d.o.

My goal has been to put all the Git LFS objects in a publicly-accessible
S3 bucket too.  While imports were running I didn't work on the bucket
side, and I suspect my download will take months to complete at current
speeds.  I publish Git LFS versions of archive.debian.org,
ftp.debian.org and ftp.ports.debian.org already, though, so perhaps I
could start on the bucket publishing part for them and see about adding
an incremental snapshot.d.o copy while it is still working.

/Simon


signature.asc
Description: PGP signature


Bug#1020217: S3-backed snapshot implementation

2024-05-05 Thread Lucas Nussbaum
Hi,

I made some minor progress on this, and I thought I'd report back (I'll
try to attend the meeting tomorrow, but I'm not sure I'll manage).


# What I did:

- I got the OK to host a S3-backed snapshot mirror using the Debian AWS
  account (see thread in #1020217)
- I got access to the account, and set up a VM with a Debian mirror.
- I could run the file-backed snapshot importer on it
- I modified the snapshot importer code to make it import to S3
  (basically it means creating an S3Backend class that inherits from
  StockageBackend), and tested it by importing the debian-security
  archive.


# What I plan to work on:

- Set up a real development environment. I plan to use Vagrant, which is
  not a perfect solution for many reasons, but anyway the provisioning
  scripts will likely be re-usable with something else.
- Change the web frontend to allow using S3.
- Improve (parallelize) the importer code, specifically the sha1-hashing
  (to process multiple files in parallel, one per core) and the file
  copying/uploading-to-S3 (this is especially important for S3 because,
  to achieve good throughput, you need many transfers in parallel).


# Open questions

## What to do with this?

Assuming all this works and we can have a S3-backed snapshot
service, there's the question of what to do with it.
We have several options I think:

### A. s3-snapshot as a mirror of snapshot.debian.org

The imports would continue to be done on snapshot.debian.org, but
everything would be mirrored on a regular basis to S3.
That would allow faster access to the data, but would not help with the
performance of imports.

### B. dual-stack snapshot.debian.org

The importer on snapshot.debian.org would import both to local stockage,
and to s3. The web app could proxy requests to both.
That would allow more resilience, but does not help with the performance
of imports (on the contrary).

### C. s3-snapshot as a fork of snapshot.debian.org

After an initial import of snapshot.debian.org data, s3-snapshot would
live its own independent life.
The main downside is that both databases will become out of sync
(not the same mirror runs; they might each miss some packages, but not
the same ones).

### D. do both at the same time

Do C, but also make sure that every file that ever gets stored in
snapshot.debian.org gets imported in the bucket used for s3-snapshot, to
be able to expose a full read-only mirror of the snapshot.debian.org DB.

### E. Nice experiment, but let's forget about it

(That should be mentioned as well)



In any case, it probably makes sense to keep at least two different
instances of the snapshot service (and data) on preferably different
implementations, to make sure that we don't lose everything in case of
catastrophic incident.

I plan to aim for C as a first step.


## How to do an initial import of snapshot.debian.org data?

That's more a technical question. The PostgreSQL DB should not be a
problem, as it's quite small (~ 20 GB). For the data itself, it could
probably be uploaded directly from local storage on the snapshot.d.o
hosts to a S3 bucket. I could upload from an EC2 VM to S3 at about 10
Gbps (limited by the bandwidth of local storage). I don't know about the
performance (storage, network) of snapshot.debian.org, but that probably
means that an import into S3 is doable in a couple of weeks in the worst
case.  In any case, that's a question to keep in mind, but that does not
need to be resolved now.

Lucas



Bug#1020217: S3-backed snapshot implementation on AWS?

2023-10-04 Thread Noah Meyerhans
On Sun, Sep 24, 2023 at 04:09:31PM -0700, Noah Meyerhans wrote:
> > > Could we use the Debian AWS account to host that service?
> > 
> > I would assume that a service like snapshot would be within the scope
> > for our AWS usage.  Noah?
> 
> It makes sense and I will look into it.  Let's not start anything until
> we hear definitive confirmation.

OK, let's do it.

noah



Bug#1020217: S3-backed snapshot implementation on AWS?

2023-09-25 Thread Lucas Nussbaum
Hi,

On 24/09/23 at 16:09 -0700, Noah Meyerhans wrote:
> On Fri, Sep 22, 2023 at 05:12:21PM +0200, Bastian Blank wrote:
> > > Could we use the Debian AWS account to host that service?
> > 
> > I would assume that a service like snapshot would be within the scope
> > for our AWS usage.  Noah?
> 
> It makes sense and I will look into it.  Let's not start anything until
> we hear definitive confirmation.  Do we have a sense of how much
> outgoing traffic the current snapshot service generates?

>From #debian-admin:

 lucas:
https://munin.debian.org/debian.org/sallinen.debian.org/ip_193_62_202_27.html
and
https://munin.debian.org/debian.org/sallinen.debian.org/ip_2001_630_206_4000_1a1a_0_c13e_ca1b.html
I think, so average of 35Mbit/sec over the last week.

> > However we need to talk about that "one […] VM", because this sounds
> > like you intend to use AWS as VM hosting, which it is not.
> > 
> > Please think about this in form of services and there should be at least
> > two:
> > - the injestor, which can only exist once and writes, and
> > - the web frontend, which should be able to exist several times and only
> >   reads.
> > 
> > So you want to plan with running the multiple web frontends with load
> > balancers and maybe even cloudfront.
> 
> I agree that it would be best to design something more cloud-oriented.
> However, if there's an existing infrastructure that can be moved as a
> "lift & shift" into AWS now, with architectural refactoring happening
> later, that's an OK place to start.

Yes, that would be the plan I think: start with moving to AWS and
replacing the filesystem-backed storage backend to an S3-backed on.
Then look at other aspects.

Lucas



Bug#1020217: S3-backed snapshot implementation on AWS?

2023-09-24 Thread Noah Meyerhans
On Fri, Sep 22, 2023 at 05:12:21PM +0200, Bastian Blank wrote:
> > Could we use the Debian AWS account to host that service?
> 
> I would assume that a service like snapshot would be within the scope
> for our AWS usage.  Noah?

It makes sense and I will look into it.  Let's not start anything until
we hear definitive confirmation.  Do we have a sense of how much
outgoing traffic the current snapshot service generates?

> >   It would
> > require one fairly powerful VM, and a large S3 bucket (approximately
> > 150-200 TB).
> 
> 200 TB should be no problem.

Agreed.

> However we need to talk about that "one […] VM", because this sounds
> like you intend to use AWS as VM hosting, which it is not.
> 
> Please think about this in form of services and there should be at least
> two:
> - the injestor, which can only exist once and writes, and
> - the web frontend, which should be able to exist several times and only
>   reads.
> 
> So you want to plan with running the multiple web frontends with load
> balancers and maybe even cloudfront.

I agree that it would be best to design something more cloud-oriented.
However, if there's an existing infrastructure that can be moved as a
"lift & shift" into AWS now, with architectural refactoring happening
later, that's an OK place to start.

noah



Bug#1020217: S3-backed snapshot implementation on AWS?

2023-09-22 Thread Bastian Blank
Hi Lucas

On Fri, Sep 22, 2023 at 08:42:10AM +0200, Lucas Nussbaum wrote:
> Could we use the Debian AWS account to host that service?

I would assume that a service like snapshot would be within the scope
for our AWS usage.  Noah?

>   It would
> require one fairly powerful VM, and a large S3 bucket (approximately
> 150-200 TB).

200 TB should be no problem.

However we need to talk about that "one […] VM", because this sounds
like you intend to use AWS as VM hosting, which it is not.

Please think about this in form of services and there should be at least
two:
- the injestor, which can only exist once and writes, and
- the web frontend, which should be able to exist several times and only
  reads.

So you want to plan with running the multiple web frontends with load
balancers and maybe even cloudfront.

Regards,
Bastian

-- 
I object to intellect without discipline;  I object to power without
constructive purpose.
-- Spock, "The Squire of Gothos", stardate 2124.5



Bug#1020217: S3-backed snapshot implementation on AWS?

2023-09-22 Thread Lucas Nussbaum
Hi Bastian,

I'm playing with the idea of a S3-backed snapshot.d.o implementation
(see #1020217).

Could we use the Debian AWS account to host that service? It would
require one fairly powerful VM, and a large S3 bucket (approximately
150-200 TB).

Best,

Lucas