Dear piler users,

I had the pleasure (so to speak) of participating in a
piler migration project from hostA to hostB. Both hosts
are on the same datacenter, the network bandwidth is unknown
to me, but we may assume it's Gbit.

There was 2+ TB data to migrate, millions of files to copy.

Sftp was chosen as the copying method, and I can tell you
that copying the /var/piler/store dirs and files was taking
several days (not 2-3, rather many more).

So my conclusion is that using sftp (or even rsync, I believe)
is painful to migrate a piler archive to another host because
of the lots of small files.

I've been thinking how to make such possible future migrations
both easier and faster. I think such a migration would be less
painful if the data in /var/piler/store/00/... dirs were not be
in lots of files.

One possible solution is to use sqlite3 files, read on.

You know that the top level dirs in /var/piler/store/00 hold
~12 days of data. After that it doesn't change, piler start writing
new emails to the next directory, nowadays it's 5a0. So what if we
could move all files in 59f to 59f.sdb, all files in 59e dir to 59e.sdb,
etc?

Then after 4 years you may end up with 4 * 365 / 12 =~ 120 big sdb files, and the latest top level dir with the lots of small files (though much fewer
files compared to files in the 120 top level dirs).

So the sdb files would be big, but copying 120 large files to another host is
way much easier than 15 million smaller files.

OK, now the question is how to move data to these sdb files? The plan is to create a utility to iterate through the top level dirs mentioned before, and write the file contents to sqlite3 db files. Finally remove the .m and .a* files
successfully copied.

The only performance penalty (after writing the sdb files) comes to my mind is that pilerget must first open the sdb file, and if it's not present (either someone is not interested in this [optional] consolidation or this is the last top level dir which has not been consolidated yet), then get the file from the
filesystem.


Another possible solution is to put all email data blob to mysql. There would be a nice table, eg. maildata with 2-3 columns and the last column as a huge blob. In this case instead of having 2-3 TB (in the case I mentioned) data and several million files, you would have a very large mysql table with varying sizes of blob
data in each rows.
I'm not sure if it's a good idea. In this case instead of using mysqldump it would be much easier to stop mysqld and copy the raw db file to the other host.


Before moving to either path I'd like to hear your comments, ideas on the topic.

Note: using sqlite data files is intended to be optional, not forcing anyone to make this step at all. However, I believe it would be great allowing you to
migrate piler easier than today.


Janos

PS: Perhaps I introduce some bias with the following info, some of you may already know:
mailarchiva uses 1024 (or so) zip files to hold the encrypted files.

Reply via email to