Re: [Dovecot] Questions about single intance storage

2011-12-13 Thread Joseba Torre

El 04/12/11 21:16, Terry Carmen escribió:

So I was thinking that there probably could be some tool that during a
user's backup it would write the attachments among the user's other
files, so it would be easy to find all of the files needed for a
restore. This would of course mean that backups can take a lot more
space, because there's no SIS. Perhaps there could be some other



I see.

Instead of writing the links directly to the filesystem, why not keep a
links list (not a linked list 8-)) file in each directory that contains
the information for the links that should be there (source, dest,
attributes), then add an inotify hook in Dovecot to create/update/delete
the hard links in the directory so they match the links list?

The links list would only need to be opened when there's a change and
could remain closed (and backup-able) at all other times, and restoring
a links list would immediately trigger the inotify hook and regenerate
all the required links.

Terry


Sorry for joining late this thread, but this is a very important issue 
for us.


Terry's solution feels great: just a little modification of the mdbox, 
adding a ¿text? file with the list of attachment files, that is modified 
every time an attachment is added/deleted.


With that, it seems quite easy to modify our mailbox recovery script to 
something like:

- recover the mailbox as now
- recover every attachment file that file points to.

Other option: a new doveadm option that could generate this list, and 
then recover the mailbox, generate the list, recover the attachments.


Also: no change needed to the backup process itself, and that's good news.



Re: [Dovecot] Questions about single intance storage

2011-12-13 Thread Timo Sirainen
On 13.12.2011, at 10.50, Joseba Torre wrote:

 Terry's solution feels great: just a little modification of the mdbox, adding 
 a ¿text? file with the list of attachment files, that is modified every time 
 an attachment is added/deleted.

I'd rather not implement that. It makes dbox more fragile and less efficient.

 With that, it seems quite easy to modify our mailbox recovery script to 
 something like:
 - recover the mailbox as now
 - recover every attachment file that file points to.
 
 Other option: a new doveadm option that could generate this list, and then 
 recover the mailbox, generate the list, recover the attachments.

That would be possible. You could actually already do it with v2.1's doveadm 
dump, which outputs dbox file's metadata.




Re: [Dovecot] Questions about single intance storage

2011-12-07 Thread Yann Dupont

Le 05/12/2011 02:45, Timo Sirainen a écrit :

On 5.12.2011, at 3.03, Stan Hoeppner wrote:


To cope with catastrophic failure, create a special Dovecot
administrator only mailbox (real/virtual/whatever) that contains all
of the SiS files, a special Dovecot index.


I'm not thinking about a catastrophe. For that a regular full filesystem 
backup+restore would work mostly okay (a snapshot would be perfect, without 
snapshot some extra work would be needed). The problem is that people want to 
recover only one specific user's mails from some older backup, because they 
accidentally deleted the mails.. This needs to be somewhat easy to implement 
with SIS, but it isn't.




Other problem I'm thinking of, because I'd like to use SIS on our 
production servers ; but right now I think I can't .


We have lots of users (+5000 teachers/engineers) on our first setup, 
+7 students on our 2nd setup.


The user base is on LDAP, and move on a daily basis. When a user leave 
the university, he has right to use mailbox for a certain time and then 
we close the account.


Right now, we archive and then delete the mailbox directories (we don't 
use special dovecot mecanism : we migrated from another system not long 
time ago and we had special scripts for that).


If we use SIS, what happen to the attachements ? The usage count will 
never go to 0, and the attachements will stay there forever.


In that situation, I think we have no means to correct the attachement 
usage count ?


--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr


Re: [Dovecot] Questions about single intance storage

2011-12-07 Thread Charles Marcus

On 2011-12-07 5:25 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:


Right now, we archive and then delete the mailbox directories (we don't
use special dovecot mecanism : we migrated from another system not long
time ago and we had special scripts for that).

If we use SIS, what happen to the attachements ? The usage count will
never go to 0, and the attachements will stay there forever.


?

Attachment count for any messages that were *only* in those deleted 
mailbox directories would go to zero after you delete them, and then the 
attachments would be deleted. Dovecot wouldn't know about any that were 
archived outside of dovecots knowledge.


--

Best regards,

Charles


Re: [Dovecot] Questions about single intance storage

2011-12-07 Thread Timo Sirainen
On 7.12.2011, at 12.25, Yann Dupont wrote:

 Other problem I'm thinking of, because I'd like to use SIS on our production 
 servers ; but right now I think I can't .
 
 We have lots of users (+5000 teachers/engineers) on our first setup, +7 
 students on our 2nd setup.
 
 The user base is on LDAP, and move on a daily basis. When a user leave the 
 university, he has right to use mailbox for a certain time and then we close 
 the account.
 
 Right now, we archive and then delete the mailbox directories (we don't use 
 special dovecot mecanism : we migrated from another system not long time ago 
 and we had special scripts for that).
 
 If we use SIS, what happen to the attachements ? The usage count will never 
 go to 0, and the attachements will stay there forever.
 
 In that situation, I think we have no means to correct the attachement usage 
 count ?

You'll need to change the deletion script then. Run:

doveadm expunge -u user mailbox '*' all

before doing rm -rf for the user's mails. And in the archiving step you should 
do it with dsync with mail_attachment_dir disabled in the destination storage, 
so the the attachments get written to the archive directly instead of only 
referencing SIS.



Re: [Dovecot] Questions about single intance storage

2011-12-07 Thread Yann Dupont

Le 07/12/2011 15:15, Timo Sirainen a écrit :

On 7.12.2011, at 12.25, Yann Dupont wrote:



doveadm expunge -u user mailbox '*' all

before doing rm -rf for the user's mails. And in the archiving step you should 
do it with dsync with mail_attachment_dir disabled in the destination storage, 
so the the attachments get written to the archive directly instead of only 
referencing SIS.


Yes, I understand, it will work. But, if case of any error (even our 
fault : premature end of script, for example) you can still end up with 
attachement forever lost on the filesystem.


Right, it SHOULD not happen, and it probably won't represent a big 
volume. But Still, it could happen under specific circonstances. In that 
case, I don't see any simple way to detect that kind of files ?


Do you see how a script could detect some orphaned links ??

For the archiving, good idea to use dsync, thanks for your answer.

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr


Re: [Dovecot] Questions about single intance storage

2011-12-07 Thread Timo Sirainen
On Wed, 2011-12-07 at 17:02 +0100, Yann Dupont wrote:
  before doing rm -rf for the user's mails. And in the archiving step you 
  should do it with dsync with mail_attachment_dir disabled in the 
  destination storage, so the the attachments get written to the archive 
  directly instead of only referencing SIS.
 
 Yes, I understand, it will work. But, if case of any error (even our 
 fault : premature end of script, for example) you can still end up with 
 attachement forever lost on the filesystem.
 
 Right, it SHOULD not happen, and it probably won't represent a big 
 volume. But Still, it could happen under specific circonstances. In that 
 case, I don't see any simple way to detect that kind of files ?
 
 Do you see how a script could detect some orphaned links ??

It wouldn't be simple. The only safe way would be to:

1. Scan through all the attachment HASH-GUID names and save them. This
scanning step could already detect some orphaned attachments, where the
hashes/HASH file exists with nlink=1 (i.e. HASH-GUID* files have been
deleted, but the HASH itself hasn't been for some reason).

2. Read through all users' all dboxes contents and get a list of all
referenced attachment HASH-GUIDs.

3. Delete all attachments that exist in list 1, but not in list 2.

I guess there should be a doveadm sis rescan command that does this.



Re: [Dovecot] Questions about single intance storage

2011-12-04 Thread Maria Arrea
Hello Timo.

 If we can not safely restore from backup a user's mailbox with SiS enabled, we 
can not enable SiS. Any plan to include this backup recovery tool in dovecot 
2.0.X or 2.1?

 Regards
 Maria

  5º We use bacula to save indexes  mdboxes, and we recover mailboxes using 
  doveadm import when a user makes a fatal mistake wiping all her Inbox. If 
  we enable SiS I am not really sure how can we safely restore a user's INBOX 
  if that user has SiS attachments. Hm. Yes, that is problematic.. Even if you 
  knew what SIS files were used, there's no simple way to restore those with 
  proper refcounts. I think what really should be done is writing a tool that 
  can create/restore backups, possibly de-SISing the attachments..


Re: [Dovecot] Questions about single intance storage

2011-12-04 Thread Timo Sirainen
On 4.12.2011, at 16.10, Maria Arrea wrote:

 If we can not safely restore from backup a user's mailbox with SiS enabled, 
 we can not enable SiS. Any plan to include this backup recovery tool in 
 dovecot 2.0.X or 2.1?

I'd first have to design it. And before designing it I'd need to look into how 
the backup softwares usually work.. If anyone has any ideas about this, I'd 
like to hear.



Re: [Dovecot] Questions about single intance storage

2011-12-04 Thread Terry Carmen
If we can not safely restore from backup a user's mailbox with SiS  
enabled, we can not enable SiS. Any plan to include this backup  
recovery tool in dovecot 2.0.X or 2.1?


I'd first have to design it. And before designing it I'd need to  
look into how the backup softwares usually work.. If anyone has any  
ideas about this, I'd like to hear.


BackupPC uses rsync by default for *nix boxes.

No idea what SiS is, but I'm guessing you're running into the same  
problem as backing up any other open file with changing internal data  
that may be inconsistent.


This is exactly what it's difficult (and pointless) to backup an open  
MySQL database or a SQL Server database. The snapshot of what's in  
memory doesn't always match what's on disk.


The only ways I know around this are to periodically create a backup  
copy that *is* consistent and restorable and a utility to restore the  
backup back to the live storage format, or create a method for the  
software to flush it's buffers to disk then disconnect from the data  
file while the backup process is running.


The first option takes ~2x the storage space, while the second option  
makes the user's data inaccessible during the backup.


My apologies if I'm misunderstanding the problem and have been  
rambling for no purpose. 8-)


Terry








Re: [Dovecot] Questions about single intance storage

2011-12-04 Thread Timo Sirainen
On 4.12.2011, at 19.41, Terry Carmen wrote:

 If we can not safely restore from backup a user's mailbox with SiS enabled, 
 we can not enable SiS. Any plan to include this backup recovery tool in 
 dovecot 2.0.X or 2.1?
 
 I'd first have to design it. And before designing it I'd need to look into 
 how the backup softwares usually work.. If anyone has any ideas about this, 
 I'd like to hear.
 
 BackupPC uses rsync by default for *nix boxes.
 
 No idea what SiS is, but I'm guessing you're running into the same problem as 
 backing up any other open file with changing internal data that may be 
 inconsistent.

Inconsistency is an issue, but it's not the biggest problem. It would be 
possible to write a tool that scans through all mails and makes sure everything 
is consistent after a restore. dbox mostly does this automatically already, but 
SIS would need a separate program to ensure its consistency.

SIS is anyway a single instance attachment storage. So that lets say you send 
one 100 MB pdf to 10 people, and it's stored only once in disk under 
/attachments/aa/bb/aabccddeeff-etc. Each people would have their own unique 
link under /attachments/, but all of them would be hard linked to a common file.

So the problem is mainly about restoring a single user's mails. The mail files 
are simple to restore, but then you need to figure out which attachments to 
restore. There's no simple way to know which attachment files belong to which 
users, so you need to scan through the mail files and see what attachments are 
referred to.

Also backing up the attachment links could be problematic if the backup system 
doesn't support hard links. Each attachment always has at least 2 links, so if 
the backup doesn't realize that it at minimum duplicates the space used by 
attachments.

So I was thinking that there probably could be some tool that during a user's 
backup it would write the attachments among the user's other files, so it would 
be easy to find all of the files needed for a restore. This would of course 
mean that backups can take a lot more space, because there's no SIS. Perhaps 
there could be some other alternatives .. like maybe not storing the 
attachments directly to backups, but add symlinks to them so they can be used 
to figure out what to restore. Or maybe the backing up wouldn't need a special 
tool, but the restoring tool could just read through the dbox files to see what 
attachments are also needed and write a list of them somewhere so they can be 
taken from backups as well.

I'm not really sure what is the best way.

Re: [Dovecot] Questions about single intance storage

2011-12-04 Thread Timo Sirainen
On 5.12.2011, at 0.07, Lorens Kockum wrote:

 Timo Sirainen wrote:
 And before designing it I'd need to look into how the backup
 softwares usually work.. If anyone has any ideas about this,
 I'd like to hear.
 
 Simple or even moderately efficient backup programs like rsync
 copy all the files.

I'm mainly wondering if it's common for backup programs to support using a 
separate program to generate the backups. For example if there was a 
dovecot-backup binary that just dumps all (or new-since-last-backup) of the 
users' mails into stdout, which the backup program can use. Or perhaps in that 
case there wouldn't really be much of anything for the backup to do except to 
write it to tape..

 Also backing up the attachment links could be problematic if
 the backup system doesn't support hard links. Each attachment
 always has at least 2 links, so if the backup doesn't realize
 that it at minimum duplicates the space used by attachments.
 
 rsync recognizes hard links with option -H, but at a very
 noticeable performance cost when dealing with millions of
 files. If the aa/bb/aabccddeeff-etc is unique across the whole
 mailstore, it would be easy to replace the hard link with a
 symlink, as you said:

SIS was designed to work with hard links. They couldn't be replaced with 
symlinks without a redesign (which would be less efficient in normal operation).

 maybe not storing the attachments directly to backups, but add
 symlinks to them so they can be used to figure out what to
 restore. Or maybe the backing up wouldn't need a special tool,
 but the restoring tool could just read through the dbox files
 to see what attachments are also needed and write a list of
 them somewhere so they can be taken from backups as well.
 
 In the second way, you would have a separate hierarchy for
 multiple-recipient attachments, or would the attachment be
 really stored in the box of a recipient chosen at random?

I meant that SIS would work exactly like it works now, with hard links and 
everything, but on top of that it would also create symlinks to the used files 
simply to make it easier to find what files are used. The annoying thing about 
that is that in error situations the symlinks can get out of sync with the 
reality.

 Just some random thoughts: professionally, I use
 Zimbra. Messages are stored in Maildir-equivalents. The time
 it takes to backup is a quite severe constraint on the backup
 technique. For example, compressing the backup files takes
 too long, so the zip files are not compressed. Instead, the
 individual mails are stored compressed on disk. Each backup
 zips up the mails in a few big backup files.

You mean you first create uncompressed zip files (why not just tar?) of all the 
mails to the filesystem and the backup software then backups those zip files?

 An improvement
 could be to sort mails into backup zip files so that once a
 zip file is made, it stays the same. After all, if a mail is not
 deleted a month after it is read, then it will probably stay
 in the same state forever, or at least until the user starts a
 keep-me-under-quota cleaning-up spree. During this time, backing
 up that big zip file can just be a check to see if it is already
 OK in the backup, which is much quicker. I have no idea if this
 could be applied to Dovecot, but who knows.


Dovecot's mdbox files already contain multiple messages in each file, so it 
should be a lot more efficient to do backups on those. And each message in an 
mdbox file can be compressed if zlib plugin is enabled. So I think that sounds 
quite a lot like what you propose.

Re: [Dovecot] Questions about single intance storage

2011-12-03 Thread Timo Sirainen
On 3.12.2011, at 22.30, Maria Arrea wrote:

 We are using dovecot 2.0.16 with mdbox+zlib. We are now testing SiS (Single 
 Instance Storage) and I have 5 questions:
 
 1º Is possible to dedup existing mdboxes?

You can dsync the mailbox elsewhere and then replace the original with the new 
copy.

 2º Are attachments compressed with zlib if mdboxes already use zlib?

Currently attachments don't support zlib at all.

 3º I have plenty CPU to spare, should I use a low value of 
 mail_attachment_min_size , like 16KB ?

It wastes disk seeks since it now has to read mail from 2 (or more) places in 
filesystem, so probably not a good idea. So in any case SiS most likely 
increases your disk IOPS usage.

 4º Can I undo SiS if I have problems?

dsync will help the other way around too.

 5º We use bacula to save indexes  mdboxes, and we recover mailboxes using 
 doveadm import when a user makes a fatal mistake wiping all her Inbox. If 
 we enable SiS I am not really sure how can we safely restore a user's INBOX 
 if that user has SiS attachments.

Hm. Yes, that is problematic.. Even if you knew what SIS files were used, 
there's no simple way to restore those with proper refcounts. I think what 
really should be done is writing a tool that can create/restore backups, 
possibly de-SISing the attachments..