Takedown of Z-Library and next steps

Geert Lovink Sun, 20 Nov 2022 02:54:56 -0800

For the context:

Z-library’s take down left users in shock, anger
 <https://maktoobmedia.com/author/vaishnavi-rastogi/>
https://maktoobmedia.com/2022/11/07/z-librarys-take-down-left-users-in-shock-anger/

—

Putting 5,998,794 books on IPFS
http://annas-blog.org/putting-5,998,794-books-on-ipfs.html

Z-Library has been taken down, and its founders arrested. For the uninitiated,
a quick recap: Z-Library was a massive “shadow library”
<https://en.wikipedia.org/wiki/Shadow_library> of books, similar to Sci-Hub or
Library Genesis. They had taken the concept of a shadow library to the next
level, with a great user interface, bulk uploading and deduplication systems,
and all sorts of other features. They were thriving on donations, and were
therefore able to hire a professional team to keep improving the site.
Until it all came crashing down two weeks ago. Their domains were seized by the
FBI, and the (alleged) founders were arrested in Argentina. The site continues
to run on Tor (presumably maintained by their employees), but no one knows how
sustainable that is. It was sad day for the free flow of information,
knowledge, and culture. Антон Напольский and Валерия Ермакова — we stand with
you. Much love to you and your families, and thank you for what you have done
for the world.

Just a few months ago, we released our second backup
<http://annas-blog.org/blog-3x-new-books.html> of Z-Library — for about 31TB in
total. This turned out to be timely. We also already had started working on a
search aggregator for shadow libraries: “Anna’s Archive” (not linking here, but
you can Google it). With Z-Library down, we scrambled to get this running as
soon as possible, and we did a soft-launch shortly thereafter. Now we’re trying
to figure out what is next. This seems the right time to step up and help shape
the next chapter of shadow libraries.

One such thing is to put the books up on IPFS
<https://en.wikipedia.org/wiki/InterPlanetary_File_System>. Some of the Library
Genesis mirrors have already done this <https://freeread.org/ipfs/> a few years
ago for their books, and it makes access to their collection more resiliant.
After all, they don’t have to host any files themselves over HTTP anymore, but
can instead link to one of the many IPFS Gateways, which will happily proxy the
books from one of the many volunteer-run machines (this is the big advantage
IPFS has over BitTorrent <https://en.wikipedia.org/wiki/BitTorrent>). These
machines can be hidden behind VPNs, or run on seedboxes paid for using crypto,
similar to torrents. You can even get other people’s machines to host the data,
by paying for that service using Filecoin.

However, putting dozens of terabytes of data on IPFS is no joke. We haven’t
fully succeeded in this project yet, so today we’ll share where we’ve gotten so
far. If you have experience pushing the limits of IPFS (or other systems, for
that matter), and want to help our cause, please reach out on Reddit or Twitter.

File organization

When we released our first backup
<http://annas-blog.org/blog-introducing.html>, we used torrents that contained
tons of individual files. This turns out not to be great for two reasons: 1.
torrent clients struggle with this many files (especially when trying to
display them in a UI) 2. magnetic hard drives and filesystems struggle as well.
You can get a lot of fragmentation and seeking back and forth.

For our second release, we learned from this, and packaged the files in large
“.tar” files. This solves these problems, but creates a new one: how do we now
serve individual files on IPFS? We could simply extract the tar files, but then
if you want to both seed the torrents, and seed the IPFS files, you need twice
as much space: 62TB instead of 31TB (which was already pushing it).

Luckily, there is a good solution for this: mounting the tar files using
ratarmount <https://github.com/mxmlnkn/ratarmount>. This creates a virtual
filesystem using FUSE. Typically we run it like this:

sudo ratarmount --fuse "allow_other" zlib2-data/*.tar zlib2/
In order to figure out which file is located where, ratarmount creates index
files which it places next to the tar files. It takes some time to do this when
you run it for the first time, so at some point we will share these index files
on our torrent page, for your convenience.

Root CIDs

The second problem we ran into, was performance issues with IPFS. The most
noticeable of these is the “advertising” or “providing” phase, where your IPFS
node tells the rest of the IPFS network what data you have. A single file
typically gets split up in 256KiB chunks, each of which gets an identifier,
called a “Content Identifier”, or “CID”. The file itself also gets a CID, which
refers to a list of the child CIDs. All in all, a single file can easily have
several, if not hundreds of these CIDs — and we have millions of files. All of
these CIDs have to be advertised on the network!

We first thought that we could solve this by using a particular feature of the
“providing” algorithm: only advertising the root CIDs of directories. The idea
was that we could take the different directories that our files were already
organized in, and advertise just the CID of that directory, and then address
them using:

/ipfs<directory CID>/<filename>
Initially this seemed to work, but we ran into issues requesting more than one
or a few files at once. It took us several days to debug this, but eventually
it seems like we found the root cause, and filed a bug report
<https://github.com/ipfs/kubo/issues/9416>. Sadly, this looks like a deep,
fundamental issue, which we cannot easily work around. So we’ll have to deal
with lots of CIDs, at least for now.

Sharding

One mitigation is to use a larger chunk size. Instead of 256KiB, we can use
1MiB (the current maximum), by using --chunker=size-1048576 on add. Another
thing that helps, is using the AcceleratedDHTClient, which batches multiple
advertising calls to the same node. Still, various operations can take a long
time, from “providing”, to just getting some stats on the repo.

This is why we’ve been playing with sharding the data across multiple IPFS
nodes, even on the same machine. We started with 32 nodes, but there the
per-node overhead seemed to get quite big, especially in terms of memory usage.
But providing became quite fast: about 5 minutes per node, where each node had
about 1 million CIDs to advertise. We are now playing with different numbers,
to see what is optimal. Unfortunately IPFS doesn’t let you easily merge or
split nodes, so this is quite time-consuming.

This is what our docker-compose.yml looks like, for example, with a single node
(other nodes omitted for brevity):

x-ipfs: &default-ipfs
image: ipfs/kubo:v0.16.0
restart: unless-stopped
environment:
- IPFS_PATH=/data/ipfs
- IPFS_PROFILE=server
command: daemon --migrate=true --agent-version-suffix=docker
--routing=dhtclient

services:
ipfs-zlib2-0:
<<: *default-ipfs
ports:
- "4011:4011/tcp"
- "4011:4011/udp"
volumes:
- "./container-init.d/:/container-init.d"
- "./ipfs-dirs/ipfs-zlib2-0:/data/ipfs"
-
"./zlib2/pilimi-zlib2-0-14679999-extra/:/data/files/pilimi-zlib2-0-14679999-extra/"
-
"./zlib2/pilimi-zlib2-14680000-14999999/:/data/files/pilimi-zlib2-14680000-14999999/"
-
"./zlib2/pilimi-zlib2-15000000-15679999/:/data/files/pilimi-zlib2-15000000-15679999/"
-
"./zlib2/pilimi-zlib2-15680000-16179999/:/data/files/pilimi-zlib2-15680000-16179999/"
# etc.
In the container-init.d/ folder that is referred there, we have a single shell
script, with the following content:

#!/bin/sh
ipfs config --json Experimental.FilestoreEnabled true
ipfs config --json Experimental.AcceleratedDHTClient true
We also manually changed the config for each node to use a unique IP address.

Processing CIDs

Once you have a bunch of nodes running, you can add data to it. In the example
configuration above, we would run:

docker-compose exec ipfs-zlib2-0 ipfs add --progress=false --nocopy --recursive
--hash=blake2b-256 --chunker=size-1048576 /data/files > ipfs-zlib2-0.log
This logs the filenames and CIDs to ipfs-zlib2-0.log. Now we can scoop up all
the different log files into a CSV, using a little Python script:

import glob

def process_line(line, csv):
components = line.split()
if len(components) == 3 and components[0] == "added":
file_components = components[2].split("/")
if len(file_components) == 3 and file_components[0] == "files":
csv.write(file_components[2] + "," + components[1] + "\n")

with open("ipfs.csv", "w") as csv:
for file in glob.glob("*.log"):
print("Processing", file)
with open(file) as f:
for line in f:
process_line(line, csv)
Because the filenames are simply the Z-Library IDs, the CSV looks something
like this:

1,bafk2bzacedrabzierer44yu5bm7faovf5s4z2vpa3ry2cx6bjrhbjenpxifio
2,bafk2bzaceckyxepao7qbhlohijcqgzt4d2lfcgecetfjd6fhzvuprqgwgnygs
3,bafk2bzacec3yohzdu5rfebtrhyyvqifib5rxadtu35vvcca5a3j6yaeds3yfy
4,bafk2bzaceacs3a4t6kfbjjpkgx562qeqzhkbslpdk7hmv5qozarqn2jid5sfg
5,bafk2bzaceac2kybzpe6esch3auugpi2zoo2yodm5bx7ddwfluomt2qd3n6kbg
6,bafk2bzacealxowh6nddsktetuixn2swkydjuehsw6chk2qyke4x2pxltp7slw
Most systems support reading CSV. For example, in Mysql you could write:

CREATE TABLE zlib_ipfs (
zlibrary_id INT NOT NULL,
ipfs_cid CHAR(62) NOT NULL,
PRIMARY KEY(zlibrary_id)
);
LOAD DATA INFILE '/var/lib/mysql/ipfs.csv'
INTO TABLE zlib_ipfs
FIELDS TERMINATED BY ',';
This data should be exactly the same for everyone, as long as you run ipfs add
with the same parameters as we did. For your convenience, we will also release
our CSV at some point, so you can link to our files on IPFS without doing all
the hashing yourself.

Remote file storage

One thing you learn quickly when hosting ~controversial~ content, is that it’s
quite useful to have long-term “backend” servers, which you don’t expose on the
public internet, and publicly facing “frontend” servers, which are more at risk
of being taken down. For serving websites, the “frontend” server can be a
simple proxy (HTTP proxy like Varnish, VPN node like Wireguard, etc). But with
IPFS, the better solution might be to actually run IPFS on the frontend server
directly. This has several advantages:

Traffic speed and latency are better without a proxy.
You can get a storage backend server with lots of hard drives and weak
cpu/memory, and the inverse for the frontend server.
You can shard across multiple physical IPFS servers, without having to move
tons of data around all the time.
For this, we use remote mounted filesystems. The easiest way to set that up
seemed to be rclone:

# File server:
rclone -vP serve sftp --addr :1234 --user hello --pass hello ./zlib1
# IPFS machine:
sudo rclone mount -v --sftp-host *redacted* --sftp-port 1234 --sftp-user hello
--sftp-pass `rclone obscure hello` --sftp-set-modtime=false --read-only
--vfs-cache-mode full --attr-timeout 100000h --dir-cache-time 100000h
--vfs-cache-max-age 100000h --vfs-cache-max-size 300G --no-modtime --transfers
6 --cache-dir ./zlib1cache --allow-other :sftp:/zlib1 ./zlib1
We’re not sure if this is the best way to do this, so if you have tips for how
to most efficiently set up a remote immutable file system with good local
caching, let us know.

Final thoughts

We’re still figuring all of this out, and don’t have it all running quite yet,
so if you have experience with this, please contact us. We’re also interested
in learning from people who have set up IPFS Collaborative Clusters
<https://ipfscluster.io/documentation/collaborative/setup/>, so more people can
easily participate in hosting these books. We’re also always looking for
volunteers to run IPFS and torrent nodes, help build new projects, and so on
(we noticed that lots of technical talent just left a certain social media
company — and who particularly care about the free flow of information.. hi!).

If you believe in preserving humanity’s knowledge and culture, please consider
supporting us. I have personally been working on this full time, mostly
self-funded, plus a couple of large generous donations. But to make this work
sustainable, we would probably need to set up a sort of “shadow Patreon”. In
the meantime, please consider donating through one of these crypto addresses:

BTC/BCH: 15ruLg4LeREntByp7Xyzhf5hu2qGn8ta2o
<bitcoin:15ruLg4LeREntByp7Xyzhf5hu2qGn8ta2o>
ETH: 0x4a47880518eD21937e7d44251bd87054c1be022E
<ethereum:0x4a47880518eD21937e7d44251bd87054c1be022E>
XMR:
445v3zW24nBbdJDAUeRG4aWmGBwqL3ctHE9DuV42d2K7KbaWeUjn13N3f9MNnfSKpFUCkiQ9RoJ1U66CG7HPhBSDQdSdi7t

<monero:445v3zW24nBbdJDAUeRG4aWmGBwqL3ctHE9DuV42d2K7KbaWeUjn13N3f9MNnfSKpFUCkiQ9RoJ1U66CG7HPhBSDQdSdi7t>
SOL: HDMUSnfFYiKNc9r2ktJ1rsmQhS8kJitKjRZtVGMVy1DP
<solana:HDMUSnfFYiKNc9r2ktJ1rsmQhS8kJitKjRZtVGMVy1DP>
For large donations, it might be good to contact us directly.
Thanks so much!

- Anna and the Pirate Library Mirror team (Twitter
<https://twitter.com/AnnaArchivist>, Reddit
<https://www.reddit.com/user/AnnaArchivist>)

#  distributed via <nettime>: no commercial use without permission
#  <nettime>  is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: http://mx.kein.org/mailman/listinfo/nettime-l
#  archive: http://www.nettime.org contact: nett...@kein.org
#  @nettime_bot tweets mail w/ sender unless #ANON is in Subject:

Takedown of Z-Library and next steps

Reply via email to