[openchange][devel] openchangemapidump and OpenChange Backup Tools

Julien Kerihuel Mon, 10 Sep 2007 02:47:21 -0700

Hi List,

Following the recent discussion about OpenChange Backup Tools, I've
started to write a draft for the openchangemapidump tool. The objective
with this first attempt is to gather as much information as we need in
order to design the first reliable implementation. I should be able to
push preliminary test code soon.


1. Introduction
================

        1a. What is openchangemapidump?
                It is a tool designed to dump a user Mailbox store at
                object-level from Exchange using MAPI. The tool needs to
                be designed so data can be restored, inspected,
                migrated, browsed or searched easily.

        1b. Links to underlying ideas and concepts:
                
http://wiki.openchange.org/index.php/OpenChangeForMAPIStoreBackup

        1c. New links about MAPI ID uniqueness and why we should use
        PR_SOURCE_KEY rather than PR_ENTRYID:
                
http://www.tech-archive.net/Archive/Development/microsoft.public.win32.programmer.messaging/2006-10/msg00090.html
                http://support.microsoft.com/default.aspx?scid=kb;en-us;230749
                http://support.microsoft.com/kb/231160
                

2. Design and Architecture
==========================

        2a. Storage:
                * MAPI object hierarchy (container, items)
                        Accessing a particular MAPI object is mostly
                        about opening containers until we access the
                        desired element. If we intend to provide an easy
                        MAPI store inspector/walker/restore tool, we
                        should provide such hierarchy. This is the main
                        reason why we won't be using mysql for object
                        storage but LDAP-like database (here ldb).
                
                * LDB limitations and SQL backend for raw content
                        Nevertheless, LDB is not designed to handle big
                        DATA blobs. This means that we shouldn't be
                        using ldb when storing large emails content
                        (PR_BODY, PR_HTML, PR_RTF_COMPRESSED,
                        attachments) but a SQL database (sqlite).

                * Preliminary conclusions: 
                        We will divide the backend storage in 2
                        different parts:
                                1. LDB database: store object hierarchy
                                with common properties, associated
                                values and links to SQL database entries
                                when we have large data blobs.
                                2. SQL database: store object raw data
                        This solution should normally take advantage of
                        each layer for specific purposes:
                                - Walk MAPI store tree and search
                                entries (LDB)
                                - Offline searches on content and large
                                DATA blob storage (SQL)
                        I'll certainly push the database model (ldb +
                        sql) on the Wiki.

        2b. Namespace for LDB records
                * MAPI EntryID uniqueness
                        This is the *only* parameter Microsoft
                        guarantees to be unique. As discussed on the
                        Wiki and referenced on Microsoft website, unique
                        doesn't mean permanent. We can anyway rely on
                        PR_SOURCE_KEY (22 bytes SBinary struct) which
                        first 16 bytes are the MAPI store GUID and last
                        6 bytes the unique object UID. We should
                        normally be able next to build PR_ENTRYID from
                        message and container's folder PR_SOURCE_KEY.
        
        2c. Checksum algorithm rather than custom property
                Since MAPI IDs (PR_ENTRYID or PR_SOURCE_KEY) are not
                permanent all along the object lifetime, we can't trust
                them while doing updates. One of the solution used in
                some other Exchange backup tools seems to add a custom
                property (backup tool object UUID) and store it on
                Exchange server. We'll rather use a checksum algorithm
                on specific properties which garantee message
                uniqueness. This data will only be stored on the client
                side. The algorithm and properties involved in the
                process still need to be defined, but openchangemapidump
                draft once pushed on SVN should provide an environment
                to test possible implementations.
        
        2c. Snapshot backup
                openchangemapidump should first provide a snapshot
                backup of the Exchange mailbox store at a given point in
                time. The issue with large mailbox is how can we prevent
                from modifications to occur in a folder already backuped
                during the process? A possible solution which needs to
                be investigated would be to use notifications: monitor
                changes at folder level, update database if changed
                occurred.
        
3. Limitations and Possible ways to fix them
============================================

        3a. Backup speed process needs improvements
                For the moment, the backup process is really slow,
                mostly due to other priority at this stage. I'm anyway
                confident we can improve fastness. Considerations below
                need to be tested but these are maybe ways we can
                explore: 
                
                - Rather than using GetPropsAll, we can maybe use
                GetPropList, filter properties to the smallest set we
                need to restore object or perform checksum operations
                correctly + GetProps. Idea: 2 calls with smallest
                content rather than a single one with content we don't
                need.
        
                - We can delay large DATA blobs fetch to the latest part
                of the process (large content + attachment):
                        1. Create the object-level hierarchy
                        2. Filter items from ldb database and access
                        them using their PR_ENTRYID.
                This method provides an easy way to track how much of
                the process is completed (computing PR_MESSAGE_SIZE for
                example) + possibly add a callback.
         
        3b. Reliable update process:
                This really needs to be discussed:
                        - We can split the backup process in 3 stages:
                                1. Check for each container whether the
                                number of items (PR_CONTENT_COUNT) or
                                the total size have changed.
                        
                                2. Check if objects are still accessible
                                through their PR_ENTRYID.
                         
                                3. Update/Modify if necessary.
                        
                While some may say it is fuzzing, I think this approach
                may be improved until some point where we can consider
                it reliable.

4. What the code currently does?
================================

        So far, I've written a very basic sample implementation:
        
        - It recursively browses the mailbox; starting at Top
        Information Store and recursively entering subfolders (some kind
        of hacked openchangeclient -m) until it access items.
        - The code creates and populates a LDB database, providing
        LDAP-like hierarchy, and dump each containers and items with all
        their properties within the database (even content
        PR_RTF_COMPRESSED etc. but no attachments)
        - Finally the current code provides trivial database update
        (only new items but no checksum calculation).
        
        Before I push the code on the SVN, I need:
                - clean-up the code and fix numerous memory leaks
                - Add storage support for multi-valued properties and
                generates a LDIF schema file to handle isSingleValued.
                - Add some skeleton files for further use (checksum
                algorithm etc.)
                - add sample sqlite backend implementation for content
                storage.

Cheers,

Julien.

-- 
Julien Kerihuel
[EMAIL PROTECTED]
OpenChange Project Manager

GPG Fingerprint: 0B55 783D A781 6329 108A  B609 7EF6 FE11 A35F 1F79

signature.asc
Description: This is a digitally signed message part

_______________________________________________
devel mailing list
[email protected]
http://mailman.openchange.org/listinfo/devel

[openchange][devel] openchangemapidump and OpenChange Backup Tools

Reply via email to