For the context:

Z-library’s take down left users in shock, anger
 <https://maktoobmedia.com/author/vaishnavi-rastogi/>
https://maktoobmedia.com/2022/11/07/z-librarys-take-down-left-users-in-shock-anger/

—

Putting 5,998,794 books on IPFS
http://annas-blog.org/putting-5,998,794-books-on-ipfs.html

Z-Library has been taken down, and its founders arrested. For the uninitiated, 
a quick recap: Z-Library was a massive “shadow library” 
<https://en.wikipedia.org/wiki/Shadow_library> of books, similar to Sci-Hub or 
Library Genesis. They had taken the concept of a shadow library to the next 
level, with a great user interface, bulk uploading and deduplication systems, 
and all sorts of other features. They were thriving on donations, and were 
therefore able to hire a professional team to keep improving the site.
Until it all came crashing down two weeks ago. Their domains were seized by the 
FBI, and the (alleged) founders were arrested in Argentina. The site continues 
to run on Tor (presumably maintained by their employees), but no one knows how 
sustainable that is. It was sad day for the free flow of information, 
knowledge, and culture. Антон Напольский and Валерия Ермакова — we stand with 
you. Much love to you and your families, and thank you for what you have done 
for the world.

Just a few months ago, we released our second backup 
<http://annas-blog.org/blog-3x-new-books.html> of Z-Library — for about 31TB in 
total. This turned out to be timely. We also already had started working on a 
search aggregator for shadow libraries: “Anna’s Archive” (not linking here, but 
you can Google it). With Z-Library down, we scrambled to get this running as 
soon as possible, and we did a soft-launch shortly thereafter. Now we’re trying 
to figure out what is next. This seems the right time to step up and help shape 
the next chapter of shadow libraries.

One such thing is to put the books up on IPFS 
<https://en.wikipedia.org/wiki/InterPlanetary_File_System>. Some of the Library 
Genesis mirrors have already done this <https://freeread.org/ipfs/> a few years 
ago for their books, and it makes access to their collection more resiliant. 
After all, they don’t have to host any files themselves over HTTP anymore, but 
can instead link to one of the many IPFS Gateways, which will happily proxy the 
books from one of the many volunteer-run machines (this is the big advantage 
IPFS has over BitTorrent <https://en.wikipedia.org/wiki/BitTorrent>). These 
machines can be hidden behind VPNs, or run on seedboxes paid for using crypto, 
similar to torrents. You can even get other people’s machines to host the data, 
by paying for that service using Filecoin.

However, putting dozens of terabytes of data on IPFS is no joke. We haven’t 
fully succeeded in this project yet, so today we’ll share where we’ve gotten so 
far. If you have experience pushing the limits of IPFS (or other systems, for 
that matter), and want to help our cause, please reach out on Reddit or Twitter.

File organization

When we released our first backup 
<http://annas-blog.org/blog-introducing.html>, we used torrents that contained 
tons of individual files. This turns out not to be great for two reasons: 1. 
torrent clients struggle with this many files (especially when trying to 
display them in a UI) 2. magnetic hard drives and filesystems struggle as well. 
You can get a lot of fragmentation and seeking back and forth.

For our second release, we learned from this, and packaged the files in large 
“.tar” files. This solves these problems, but creates a new one: how do we now 
serve individual files on IPFS? We could simply extract the tar files, but then 
if you want to both seed the torrents, and seed the IPFS files, you need twice 
as much space: 62TB instead of 31TB (which was already pushing it).

Luckily, there is a good solution for this: mounting the tar files using 
ratarmount <https://github.com/mxmlnkn/ratarmount>. This creates a virtual 
filesystem using FUSE. Typically we run it like this:

sudo ratarmount --fuse "allow_other" zlib2-data/*.tar zlib2/
In order to figure out which file is located where, ratarmount creates index 
files which it places next to the tar files. It takes some time to do this when 
you run it for the first time, so at some point we will share these index files 
on our torrent page, for your convenience.

Root CIDs

The second problem we ran into, was performance issues with IPFS. The most 
noticeable of these is the “advertising” or “providing” phase, where your IPFS 
node tells the rest of the IPFS network what data you have. A single file 
typically gets split up in 256KiB chunks, each of which gets an identifier, 
called a “Content Identifier”, or “CID”. The file itself also gets a CID, which 
refers to a list of the child CIDs. All in all, a single file can easily have 
several, if not hundreds of these CIDs — and we have millions of files. All of 
these CIDs have to be advertised on the network!

We first thought that we could solve this by using a particular feature of the 
“providing” algorithm: only advertising the root CIDs of directories. The idea 
was that we could take the different directories that our files were already 
organized in, and advertise just the CID of that directory, and then address 
them using:

/ipfs<directory CID>/<filename>
Initially this seemed to work, but we ran into issues requesting more than one 
or a few files at once. It took us several days to debug this, but eventually 
it seems like we found the root cause, and filed a bug report 
<https://github.com/ipfs/kubo/issues/9416>. Sadly, this looks like a deep, 
fundamental issue, which we cannot easily work around. So we’ll have to deal 
with lots of CIDs, at least for now.

Sharding

One mitigation is to use a larger chunk size. Instead of 256KiB, we can use 
1MiB (the current maximum), by using --chunker=size-1048576 on add. Another 
thing that helps, is using the AcceleratedDHTClient, which batches multiple 
advertising calls to the same node. Still, various operations can take a long 
time, from “providing”, to just getting some stats on the repo.

This is why we’ve been playing with sharding the data across multiple IPFS 
nodes, even on the same machine. We started with 32 nodes, but there the 
per-node overhead seemed to get quite big, especially in terms of memory usage. 
But providing became quite fast: about 5 minutes per node, where each node had 
about 1 million CIDs to advertise. We are now playing with different numbers, 
to see what is optimal. Unfortunately IPFS doesn’t let you easily merge or 
split nodes, so this is quite time-consuming.

This is what our docker-compose.yml looks like, for example, with a single node 
(other nodes omitted for brevity):

x-ipfs: &default-ipfs
  image: ipfs/kubo:v0.16.0
  restart: unless-stopped
  environment:
    - IPFS_PATH=/data/ipfs
    - IPFS_PROFILE=server
  command: daemon --migrate=true --agent-version-suffix=docker 
--routing=dhtclient

services:
  ipfs-zlib2-0:
    <<: *default-ipfs
    ports:
      - "4011:4011/tcp"
      - "4011:4011/udp"
    volumes:
      - "./container-init.d/:/container-init.d"
      - "./ipfs-dirs/ipfs-zlib2-0:/data/ipfs"
      - 
"./zlib2/pilimi-zlib2-0-14679999-extra/:/data/files/pilimi-zlib2-0-14679999-extra/"
      - 
"./zlib2/pilimi-zlib2-14680000-14999999/:/data/files/pilimi-zlib2-14680000-14999999/"
      - 
"./zlib2/pilimi-zlib2-15000000-15679999/:/data/files/pilimi-zlib2-15000000-15679999/"
      - 
"./zlib2/pilimi-zlib2-15680000-16179999/:/data/files/pilimi-zlib2-15680000-16179999/"
      # etc.
In the container-init.d/ folder that is referred there, we have a single shell 
script, with the following content:

#!/bin/sh
ipfs config --json Experimental.FilestoreEnabled true
ipfs config --json Experimental.AcceleratedDHTClient true
We also manually changed the config for each node to use a unique IP address.

Processing CIDs

Once you have a bunch of nodes running, you can add data to it. In the example 
configuration above, we would run:       

docker-compose exec ipfs-zlib2-0 ipfs add --progress=false --nocopy --recursive 
--hash=blake2b-256 --chunker=size-1048576 /data/files > ipfs-zlib2-0.log
This logs the filenames and CIDs to ipfs-zlib2-0.log. Now we can scoop up all 
the different log files into a CSV, using a little Python script:

import glob

def process_line(line, csv):
  components = line.split()
  if len(components) == 3 and components[0] == "added":
    file_components = components[2].split("/")
    if len(file_components) == 3 and file_components[0] == "files":
      csv.write(file_components[2] + "," + components[1] + "\n")

with open("ipfs.csv", "w") as csv:
  for file in glob.glob("*.log"):
    print("Processing", file)
    with open(file) as f:
      for line in f:
        process_line(line, csv)
Because the filenames are simply the Z-Library IDs, the CSV looks something 
like this:

1,bafk2bzacedrabzierer44yu5bm7faovf5s4z2vpa3ry2cx6bjrhbjenpxifio
2,bafk2bzaceckyxepao7qbhlohijcqgzt4d2lfcgecetfjd6fhzvuprqgwgnygs
3,bafk2bzacec3yohzdu5rfebtrhyyvqifib5rxadtu35vvcca5a3j6yaeds3yfy
4,bafk2bzaceacs3a4t6kfbjjpkgx562qeqzhkbslpdk7hmv5qozarqn2jid5sfg
5,bafk2bzaceac2kybzpe6esch3auugpi2zoo2yodm5bx7ddwfluomt2qd3n6kbg
6,bafk2bzacealxowh6nddsktetuixn2swkydjuehsw6chk2qyke4x2pxltp7slw
Most systems support reading CSV. For example, in Mysql you could write:

CREATE TABLE zlib_ipfs (
  zlibrary_id INT NOT NULL,
  ipfs_cid CHAR(62) NOT NULL,
  PRIMARY KEY(zlibrary_id)
);
LOAD DATA INFILE '/var/lib/mysql/ipfs.csv'
  INTO TABLE zlib_ipfs
  FIELDS TERMINATED BY ',';
This data should be exactly the same for everyone, as long as you run ipfs add 
with the same parameters as we did. For your convenience, we will also release 
our CSV at some point, so you can link to our files on IPFS without doing all 
the hashing yourself.

Remote file storage

One thing you learn quickly when hosting ~controversial~ content, is that it’s 
quite useful to have long-term “backend” servers, which you don’t expose on the 
public internet, and publicly facing “frontend” servers, which are more at risk 
of being taken down. For serving websites, the “frontend” server can be a 
simple proxy (HTTP proxy like Varnish, VPN node like Wireguard, etc). But with 
IPFS, the better solution might be to actually run IPFS on the frontend server 
directly. This has several advantages:

Traffic speed and latency are better without a proxy.
You can get a storage backend server with lots of hard drives and weak 
cpu/memory, and the inverse for the frontend server.
You can shard across multiple physical IPFS servers, without having to move 
tons of data around all the time.
For this, we use remote mounted filesystems. The easiest way to set that up 
seemed to be rclone:

# File server:
rclone -vP serve sftp --addr :1234 --user hello --pass hello ./zlib1
# IPFS machine:
sudo rclone mount -v --sftp-host *redacted* --sftp-port 1234 --sftp-user hello 
--sftp-pass `rclone obscure hello` --sftp-set-modtime=false --read-only 
--vfs-cache-mode full --attr-timeout 100000h --dir-cache-time 100000h 
--vfs-cache-max-age 100000h --vfs-cache-max-size 300G --no-modtime --transfers 
6 --cache-dir ./zlib1cache --allow-other :sftp:/zlib1 ./zlib1
We’re not sure if this is the best way to do this, so if you have tips for how 
to most efficiently set up a remote immutable file system with good local 
caching, let us know.

Final thoughts

We’re still figuring all of this out, and don’t have it all running quite yet, 
so if you have experience with this, please contact us. We’re also interested 
in learning from people who have set up IPFS Collaborative Clusters 
<https://ipfscluster.io/documentation/collaborative/setup/>, so more people can 
easily participate in hosting these books. We’re also always looking for 
volunteers to run IPFS and torrent nodes, help build new projects, and so on 
(we noticed that lots of technical talent just left a certain social media 
company — and who particularly care about the free flow of information.. hi!).

If you believe in preserving humanity’s knowledge and culture, please consider 
supporting us. I have personally been working on this full time, mostly 
self-funded, plus a couple of large generous donations. But to make this work 
sustainable, we would probably need to set up a sort of “shadow Patreon”. In 
the meantime, please consider donating through one of these crypto addresses:

BTC/BCH: 15ruLg4LeREntByp7Xyzhf5hu2qGn8ta2o 
<bitcoin:15ruLg4LeREntByp7Xyzhf5hu2qGn8ta2o>
ETH: 0x4a47880518eD21937e7d44251bd87054c1be022E 
<ethereum:0x4a47880518eD21937e7d44251bd87054c1be022E>
XMR: 
445v3zW24nBbdJDAUeRG4aWmGBwqL3ctHE9DuV42d2K7KbaWeUjn13N3f9MNnfSKpFUCkiQ9RoJ1U66CG7HPhBSDQdSdi7t
 
<monero:445v3zW24nBbdJDAUeRG4aWmGBwqL3ctHE9DuV42d2K7KbaWeUjn13N3f9MNnfSKpFUCkiQ9RoJ1U66CG7HPhBSDQdSdi7t>
SOL: HDMUSnfFYiKNc9r2ktJ1rsmQhS8kJitKjRZtVGMVy1DP 
<solana:HDMUSnfFYiKNc9r2ktJ1rsmQhS8kJitKjRZtVGMVy1DP>
For large donations, it might be good to contact us directly.
Thanks so much!

- Anna and the Pirate Library Mirror team (Twitter 
<https://twitter.com/AnnaArchivist>, Reddit 
<https://www.reddit.com/user/AnnaArchivist>)       


#  distributed via <nettime>: no commercial use without permission
#  <nettime>  is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: http://mx.kein.org/mailman/listinfo/nettime-l
#  archive: http://www.nettime.org contact: nett...@kein.org
#  @nettime_bot tweets mail w/ sender unless #ANON is in Subject:

Reply via email to