Dear piler users,
I had the pleasure (so to speak) of participating in a
piler migration project from hostA to hostB. Both hosts
are on the same datacenter, the network bandwidth is unknown
to me, but we may assume it's Gbit.
There was 2+ TB data to migrate, millions of files to copy.
Sftp was chosen as the copying method, and I can tell you
that copying the /var/piler/store dirs and files was taking
several days (not 2-3, rather many more).
So my conclusion is that using sftp (or even rsync, I believe)
is painful to migrate a piler archive to another host because
of the lots of small files.
I've been thinking how to make such possible future migrations
both easier and faster. I think such a migration would be less
painful if the data in /var/piler/store/00/... dirs were not be
in lots of files.
One possible solution is to use sqlite3 files, read on.
You know that the top level dirs in /var/piler/store/00 hold
~12 days of data. After that it doesn't change, piler start writing
new emails to the next directory, nowadays it's 5a0. So what if we
could move all files in 59f to 59f.sdb, all files in 59e dir to 59e.sdb,
etc?
Then after 4 years you may end up with 4 * 365 / 12 =~ 120 big sdb
files,
and the latest top level dir with the lots of small files (though much
fewer
files compared to files in the 120 top level dirs).
So the sdb files would be big, but copying 120 large files to another
host is
way much easier than 15 million smaller files.
OK, now the question is how to move data to these sdb files? The plan is
to
create a utility to iterate through the top level dirs mentioned before,
and
write the file contents to sqlite3 db files. Finally remove the .m and
.a* files
successfully copied.
The only performance penalty (after writing the sdb files) comes to my
mind is
that pilerget must first open the sdb file, and if it's not present
(either
someone is not interested in this [optional] consolidation or this is
the last
top level dir which has not been consolidated yet), then get the file
from the
filesystem.
Another possible solution is to put all email data blob to mysql. There
would
be a nice table, eg. maildata with 2-3 columns and the last column as a
huge blob.
In this case instead of having 2-3 TB (in the case I mentioned) data and
several
million files, you would have a very large mysql table with varying
sizes of blob
data in each rows.
I'm not sure if it's a good idea. In this case instead of using
mysqldump it would
be much easier to stop mysqld and copy the raw db file to the other
host.
Before moving to either path I'd like to hear your comments, ideas on
the topic.
Note: using sqlite data files is intended to be optional, not forcing
anyone to
make this step at all. However, I believe it would be great allowing you
to
migrate piler easier than today.
Janos
PS: Perhaps I introduce some bias with the following info, some of you
may already know:
mailarchiva uses 1024 (or so) zip files to hold the encrypted files.