Test results for [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-14 Thread Konstantinos Skarlatos
Hello, Here are the test results from my testing of the latest patches of btrfs dedup. TLDR; I rsynced 10 separate copies of a 3.8GB folder with 138 RAW photographs (23-36MiB) on a btrfs volume with dedup enabled. On the first try, the copy was very slow, and a sync after that took over 10

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-11 Thread Martin Steigerwald
, Liu Bo wrote: Hello, This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel. Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-11 Thread Liu Bo
dedupe, based on Linux _3.14_ kernel. Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2], it introduces inband data deduplication

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-11 Thread Michael
Hi Liu, Thanks for your work. Each test copy 2gb file from sdf (btrfs) to sde (btrfs with dedup 4k blocksize). Before every test i recreate filesystem. On second write all goods. Test 1 Nodesize = leafsize = 4k Write overhead ~ x1.5 Test 2 Nodesize = leafsize = 16k Write overhead ~ x19

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-10 Thread Konstantinos Skarlatos
On 10/4/2014 6:48 πμ, Liu Bo wrote: Hello, This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel. Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-10 Thread Liu Bo
On Thu, Apr 10, 2014 at 12:08:17PM +0300, Konstantinos Skarlatos wrote: On 10/4/2014 6:48 πμ, Liu Bo wrote: Hello, This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel. Data deduplication is a specialized data compression technique for eliminating duplicate copies

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-10 Thread Liu Bo
, This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel. Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2], it introduces inband data

[RFC PATCH v9 00/16] Online(inband) data deduplication

2014-04-09 Thread Liu Bo
Hello, This the 9th attempt for in-band data dedupe. Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2], it introduces inband data deduplication

[RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-09 Thread Liu Bo
Hello, This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel. Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2], it introduces

Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-02-26 Thread Jannis Achstetter
Jannis Achstetter jannis_achstetter at web.de writes: I tried yout btrfs deduplication patches today (on top of 3.13.2-gentoo) and it seems that the deduplication works great (when copying the same or similar data to the file system, the used size reported by df -h grows less than the data

Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-02-26 Thread Liu Bo
Hi Jannis, On Wed, Feb 26, 2014 at 08:20:01PM +, Jannis Achstetter wrote: Jannis Achstetter jannis_achstetter at web.de writes: I tried yout btrfs deduplication patches today (on top of 3.13.2-gentoo) and it seems that the deduplication works great (when copying the same or similar

Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-02-25 Thread Jannis Achstetter
Hello Liu, hello list, Liu Bo bo.li.liu at oracle.com writes: Here is the New Year patch bomb I tried yout btrfs deduplication patches today (on top of 3.13.2-gentoo) and it seems that the deduplication works great (when copying the same or similar data to the file system, the used size

Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-02-25 Thread Jannis Achstetter
Jannis Achstetter jannis_achstetter at web.de writes: Hello Liu, hello list, Liu Bo bo.li.liu at oracle.com writes: Here is the New Year patch bomb Some more info I forgot: I set the dedup block size to 128k but I forgot it the first time: btrfs dedup enable /mnt/steamdir btrfs

Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-01-02 Thread Konstantinos Skarlatos
, Liu Bo wrote: Hello, Here is the New Year patch bomb :-) Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2], it introduces inband data deduplication

[RFC PATCH v8 00/14] Online(inband) data deduplication

2013-12-30 Thread Liu Bo
Hello, Here is the New Year patch bomb :-) Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2], it introduces inband data deduplication for btrfs

Re: [RFC PATCH v7 00/13] Online(inband) data deduplication

2013-10-22 Thread Aurelien Jarno
Hi, On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote: Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2]. PATCH 1 is a hang fix

Re: [RFC PATCH v7 00/13] Online(inband) data deduplication

2013-10-22 Thread Liu Bo
On Tue, Oct 22, 2013 at 08:55:24PM +0200, Aurelien Jarno wrote: Hi, On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote: Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related

Re: [RFC PATCH v7 00/13] Online(inband) data deduplication

2013-10-22 Thread Liu Bo
(Cced: David) On Wed, Oct 23, 2013 at 10:26:17AM +0800, Liu Bo wrote: On Tue, Oct 22, 2013 at 08:55:24PM +0200, Aurelien Jarno wrote: Hi, On Mon, Oct 14, 2013 at 12:59:42PM +0800, Liu Bo wrote: Data deduplication is a specialized data compression technique for eliminating

[RFC PATCH v7 00/13] Online(inband) data deduplication

2013-10-13 Thread Liu Bo
Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2]. PATCH 1 is a hang fix with deduplication on, but it's also useful without dedup in practice use

Re: [RFC PATCH v6 5/5] Btrfs: online data deduplication

2013-09-09 Thread Liu Bo
Hi Dave, On Mon, Sep 02, 2013 at 06:19:42PM +0200, David Sterba wrote: I wanted to only comment on the ioctl and interface to userspace bits, but found more things to comment in the kernel code. Sorry for the late reply(I'm out for vacation these days). On Thu, Aug 08, 2013 at 04:35:45PM

Re: [RFC PATCH v6 5/5] Btrfs: online data deduplication

2013-09-02 Thread David Sterba
I wanted to only comment on the ioctl and interface to userspace bits, but found more things to comment in the kernel code. On Thu, Aug 08, 2013 at 04:35:45PM +0800, Liu Bo wrote: --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -94,6 +95,9 @@ struct btrfs_ordered_sum; /* for storing balance

[RFC PATCH v6 0/5] Online data deduplication

2013-08-08 Thread Liu Bo
Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2]. PATCH 1 is a hang fix with deduplication on, but it's also useful without dedup in practice use

Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication

2013-08-01 Thread Liu Bo
On Wed, Jul 31, 2013 at 03:50:50PM -0700, Zach Brown wrote: +#define BTRFS_DEDUP_HASH_SIZE 32 /* 256bit = 32 * 8bit */ +#define BTRFS_DEDUP_HASH_LEN 4 + +struct btrfs_dedup_hash_item { + /* FIXME: put a hash type field here */ + + __le64 hash[BTRFS_DEDUP_HASH_LEN]; +}

Re: [RFC PATCH v5 0/5] Online data deduplication

2013-08-01 Thread Liu Bo
On Wed, Jul 31, 2013 at 05:20:27PM -0400, Josef Bacik wrote: On Wed, Jul 31, 2013 at 11:37:40PM +0800, Liu Bo wrote: Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based

Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication

2013-08-01 Thread Zach Brown
So do you mean that our whole hash value will be (key.objectid + bytes) because key.objectid is a part of hash value? I think so, if I understood your question. The idea is to not store the bytes of the hash that make up the objectid more than once so the tree items are smaller. For example:

[RFC PATCH v5 0/5] Online data deduplication

2013-07-31 Thread Liu Bo
Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2]. PATCH 1 is a hang fix with deduplication on, but it's also useful without dedup in practice use

Re: [RFC PATCH v5 0/5] Online data deduplication

2013-07-31 Thread Josef Bacik
On Wed, Jul 31, 2013 at 11:37:40PM +0800, Liu Bo wrote: Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2]. PATCH 1 is a hang fix

Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication

2013-07-31 Thread Zach Brown
+#define BTRFS_DEDUP_HASH_SIZE 32 /* 256bit = 32 * 8bit */ +#define BTRFS_DEDUP_HASH_LEN 4 + +struct btrfs_dedup_hash_item { + /* FIXME: put a hash type field here */ + + __le64 hash[BTRFS_DEDUP_HASH_LEN]; +} __attribute__ ((__packed__)); The handling of hashes in this patch

Fwd: Online data deduplication

2013-07-31 Thread Hemanth Kumar
#2 FAILED at 542. 1 out of 2 hunks FAILED -- saving rejects to file include/uapi/linux/btrfs.h.rej On Tue, Jul 30, 2013 at 7:37 AM, Liu Bo bo.li@oracle.com wrote: On Mon, Jul 29, 2013 at 09:05:42PM +0530, Hemanth Kumar wrote: Hello, I am willing to perform QA on online data deduplication

Online data deduplication

2013-07-29 Thread Hemanth Kumar
Hello, I am willing to perform QA on online data deduplication. From where can i download the patches? -- Thanks, Hemanth Kumar H C -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http

Re: Online data deduplication

2013-07-29 Thread Liu Bo
On Mon, Jul 29, 2013 at 09:05:42PM +0530, Hemanth Kumar wrote: Hello, I am willing to perform QA on online data deduplication. From where can i download the patches? Hi Hemanth Kumar H C, I really appreciate this :) Right now I'm planning v5 version patch set, which will come out probably

[RFC PATCH V4 0/2] Online data deduplication

2013-05-14 Thread Liu Bo
Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2]. PATCH 1 is a hang fix with deduplication on, but it's also useful with no deduplication in practice

Re: [RFC PATCH v3 0/2] Online data deduplication

2013-05-03 Thread Liu Bo
You didn't use an INCOPMAT option for this so you need to deal with a user mounting the file system with an older kernel or even forgetting to use mount -o dedup. Otherwise your dedup tree will become out of date and you could corrupt peoples data. So if you aren't going to use an

[RFC PATCH v3 0/2] Online data deduplication

2013-05-01 Thread Liu Bo
NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data! Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage in project ideas[2]. PATCH 1 is a hang fix when

[PATCH v3 2/2] Btrfs: online data deduplication

2013-05-01 Thread Liu Bo
(NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.) This introduce the online data deduplication feature for btrfs. (1) WHY do we need deduplication? To improve our storage effiency. (2) WHAT is deduplication? Two key ways for practical deduplication implementations

Re: [RFC PATCH v3 0/2] Online data deduplication

2013-05-01 Thread Josef Bacik
On Wed, May 01, 2013 at 10:27:36AM -0600, Liu Bo wrote: NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data! Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based

Re: [PATCH v3 2/2] Btrfs: online data deduplication

2013-05-01 Thread Gabriel de Perthuis
#define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \ struct btrfs_ioctl_dev_replace_args) +#define BTRFS_IOC_DEDUP_REGISTER _IO(BTRFS_IOCTL_MAGIC, 54) This number has already been used by the offline dedup patches. -- To unsubscribe from this

[PATCH v2 0/2] Online data deduplication

2013-04-14 Thread Liu Bo
This is the second attempt for online data deduplication. NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data! Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based

[PATCH v2 1/2] Btrfs: online data deduplication

2013-04-14 Thread Liu Bo
(NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.) This introduce the online data deduplication feature for btrfs. (1) WHY do we need deduplication? To improve our storage effiency. (2) WHAT is deduplication? Two key ways for practical deduplication implementations

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread Marek Otahal
.) This introduce the online data deduplication feature for btrfs. (1) WHY do we need deduplication? To improve our storage effiency. (2) WHAT is deduplication? Two key ways for practical deduplication implementations, * When the data is deduplicated (inband vs background

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread Liu Bo
NOT use it on real data.) This introduce the online data deduplication feature for btrfs. (1) WHY do we need deduplication? To improve our storage effiency. (2) WHAT is deduplication? Two key ways for practical deduplication implementations, * When the data

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread David Sterba
On Mon, Apr 08, 2013 at 10:08:54PM +0800, Liu Bo wrote: Is it safe to use just 64 bits? I'd like to see better reasoning why this is ok. The limitation of btrfs_key to store only 1-2 64bit items is clear and must be handled, but it's IMO a critical design point. Actually I use the whole

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-10 Thread David Sterba
On Tue, Apr 09, 2013 at 09:52:42AM +0800, Miao Xie wrote: Onmon, 8 Apr 2013 15:47:27 +0200, David Sterba wrote: This also depends on file data type and access patterns, fixing the dedup basic chunk size to one block does not IMHO fit most usecases. Maybe we can make btrfs(including

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Josef Bacik
On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote: (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.) This introduce the online data deduplication feature for btrfs. (1) WHY do we need deduplication? To improve our storage effiency. (2) WHAT is deduplication

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread David Sterba
On Sun, Apr 07, 2013 at 09:12:48PM +0800, Liu Bo wrote: (2) WHAT is deduplication? Two key ways for practical deduplication implementations, * When the data is deduplicated (inband vs background) * The granularity of the deduplication. (block level vs file level)

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Liu Bo
On Mon, Apr 08, 2013 at 03:47:27PM +0200, David Sterba wrote: On Sun, Apr 07, 2013 at 09:12:48PM +0800, Liu Bo wrote: (2) WHAT is deduplication? Two key ways for practical deduplication implementations, * When the data is deduplicated (inband vs background) * The

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Liu Bo
On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote: On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote: (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.) This introduce the online data deduplication feature for btrfs. (1) WHY do we need deduplication

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Josef Bacik
On Mon, Apr 08, 2013 at 08:16:26AM -0600, Liu Bo wrote: On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote: On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote: (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.) This introduce the online data deduplication

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Liu Bo
On Mon, Apr 08, 2013 at 04:37:20PM -0400, Josef Bacik wrote: On Mon, Apr 08, 2013 at 08:16:26AM -0600, Liu Bo wrote: On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote: On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote: [...] + __le64 dedup_hash; + }

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Miao Xie
On mon, 8 Apr 2013 22:16:26 +0800, Liu Bo wrote: On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote: On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote: (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.) This introduce the online data deduplication feature

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Josef Bacik
On Mon, Apr 8, 2013 at 9:34 PM, Liu Bo bo.li@oracle.com wrote: On Mon, Apr 08, 2013 at 04:37:20PM -0400, Josef Bacik wrote: On Mon, Apr 08, 2013 at 08:16:26AM -0600, Liu Bo wrote: On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote: On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu

Re: [PATCH 1/2] Btrfs: online data deduplication

2013-04-08 Thread Miao Xie
On mon, 8 Apr 2013 15:47:27 +0200, David Sterba wrote: On Sun, Apr 07, 2013 at 09:12:48PM +0800, Liu Bo wrote: (2) WHAT is deduplication? Two key ways for practical deduplication implementations, * When the data is deduplicated (inband vs background) * The

[PATCH 0/2 RFC] Online data deduplication

2013-04-07 Thread Liu Bo
This is the first attempt for online data deduplication. NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data! Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage

[PATCH 0/2 RFC] Online data deduplication

2013-04-07 Thread Liu Bo
This is the first attempt for online data deduplication. NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data! Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to Content based storage

[PATCH 1/2] Btrfs: online data deduplication

2013-04-07 Thread Liu Bo
(NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.) This introduce the online data deduplication feature for btrfs. (1) WHY do we need deduplication? To improve our storage effiency. (2) WHAT is deduplication? Two key ways for practical deduplication implementations

Re: btrfs vs data deduplication

2011-09-18 Thread Maciej Marcin Piechotka
On Sat, 2011-07-09 at 08:19 +0200, Paweł Brodacki wrote: Hello, I've stumbled upon this article: http://storagemojo.com/2011/06/27/de-dup-too-much-of-good-thing/ Reportedly Sandforce SF1200 SSD controller does internally block-level data de-duplication. This effectively removes the

Re: btrfs vs data deduplication

2011-09-18 Thread Chris Samuel
On Mon, 19 Sep 2011, 06:15:51 EST, Hubert Kario hub...@kario.pl wrote: You shouldn't depend on single drive, metadata raid is there to protect against single bad blocks, not disk crash. I guess the issue here is you no longer even have that protection with this sort of dedup. cheers, Chris

btrfs vs data deduplication

2011-07-09 Thread Paweł Brodacki
Hello, I've stumbled upon this article: http://storagemojo.com/2011/06/27/de-dup-too-much-of-good-thing/ Reportedly Sandforce SF1200 SSD controller does internally block-level data de-duplication. This effectively removes the additional protection given by writing multiple metadata copies. This

Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Tomasz Chmielewski
Chris Mason wrote: On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote: Hello Chris, My question is now, how often can a block in btrfs be refferenced? The exact answer depends on if we are referencing it from a single file or from multiple files. But either way it is roughly

Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Chris Mason
On Fri, Jun 05, 2009 at 02:20:48PM +0200, Tomasz Chmielewski wrote: Chris Mason wrote: On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote: Hello Chris, My question is now, how often can a block in btrfs be refferenced? The exact answer depends on if we are referencing it from

Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Tomasz Chmielewski
Chris Mason wrote: I wonder how well would deduplication work with defragmentation? One excludes the other to some extent. Very much so ;) Ideally we end up doing dedup in large extents, but it will definitely increase the overall fragmentation of the FS. Defragmentation could lead to

Re: Data Deduplication with the help of an online filesystem check

2009-06-04 Thread Thomas Glanzmann
Hello Chris, My question is now, how often can a block in btrfs be refferenced? The exact answer depends on if we are referencing it from a single file or from multiple files. But either way it is roughly 2^32. could you please explain to me what underlying datastructure is used to

Re: Data Deduplication with the help of an online filesystem check

2009-05-24 Thread Thomas Glanzmann
Hello Heinz, Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented), but with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That means, I have to have *lots* of comparisions (size of

Re: Data Deduplication with the help of an online filesystem check

2009-05-06 Thread Sander
Heinz-Josef Claes wrote (ao): Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason: On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote: Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security

Re: Data Deduplication with the help of an online filesystem check

2009-05-05 Thread Heinz-Josef Claes
On Tue, 5 May 2009 07:29:45 +1000 Dmitri Nikulin dniku...@gmail.com wrote: On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes hjcl...@web.de wrote: Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented),

Re: Data Deduplication with the help of an online filesystem check

2009-05-05 Thread Thomas Glanzmann
Hello Jan, * Jan-Frode Myklebust janfr...@tanso.net [090504 20:20]: thin or shallow clones sounds more like sparse images. I believe linked clones is the word for running multiple virtual machines off a single gold image. Ref, the VMware View Composer section of: not exactly. VMware has one

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Tomasz Chmielewski
Ric Wheeler schrieb: One thing in the above scheme that would be really interesting for all possible hash functions is maintaining good stats on hash collisions, effectiveness of the hash, etc. There has been a lot of press about MD5 hash collisions for example - it would be really neat to be

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Hello Ric, (1) Block level or file level dedup? what is the difference between the two? (2) Inband dedup (during a write) or background dedup? I think inband dedup is way to intensive on ressources (memory) and also would kill every performance benchmark. So I think the offline dedup is the

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler
Thomas Glanzmann wrote: Hello Ric, (1) Block level or file level dedup? what is the difference between the two? (2) Inband dedup (during a write) or background dedup? I think inband dedup is way to intensive on ressources (memory) and also would kill every performance benchmark. So I

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler
On 05/04/2009 10:39 AM, Tomasz Chmielewski wrote: Ric Wheeler schrieb: One thing in the above scheme that would be really interesting for all possible hash functions is maintaining good stats on hash collisions, effectiveness of the hash, etc. There has been a lot of press about MD5 hash

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler
On 04/28/2009 01:41 PM, Michael Tharp wrote: Thomas Glanzmann wrote: no, I just used the md5 checksum. And even if I have a hash escalation which is highly unlikely it still gives a good house number. I'd start with a crc32 and/or MD5 to find candidate blocks, then do a bytewise comparison

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Hello Andrey, As far as I understand, VMware already ships this gold image feature (as they call it) for Windows environments and claims it to be very efficient. they call it ,,thin or shallow clones'' and ship it with desktop virtualization (one vm per thinclient user) and for VMware lab

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Ric, I would not categorize it as offline, but just not as inband (i.e., you can run a low priority background process to handle dedup). Offline windows are extremely rare in production sites these days and it could take a very long time to do dedup at the block level over a large file

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Jan-Frode Myklebust
On 2009-05-04, Thomas Glanzmann tho...@glanzmann.de wrote: As far as I understand, VMware already ships this gold image feature (as they call it) for Windows environments and claims it to be very efficient. they call it ,,thin or shallow clones'' thin or shallow clones sounds more like

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Heinz-Josef Claes
Thomas Glanzmann schrieb: Ric, I would not categorize it as offline, but just not as inband (i.e., you can run a low priority background process to handle dedup). Offline windows are extremely rare in production sites these days and it could take a very long time to do dedup at

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Andrey Kuzmin
On Mon, May 4, 2009 at 10:06 PM, Jan-Frode Myklebust janfr...@tanso.net wrote: Looking at the website content, it also revealed that VMware will have a similiar feature for their workhorse ,,esx server'' in the upcoming release, however my point still stands. Ship out a service pack for

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Dmitri Nikulin
On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes hjcl...@web.de wrote: Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented), but with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That

Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Michael Tharp
Thomas Glanzmann wrote: Looking at this picture, when I'm going to implement the dedup code, do I also have to take care to spread the blocks over the different devices or is there already infrastructure in place that automates that process? If you somehow had blocks duplicated exactly across

Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Chris Mason
On Wed, 2009-04-29 at 14:03 +0200, Thomas Glanzmann wrote: Hello Chris, You can start with the code documentation section on http://btrfs.wiki.kernel.org I read through this and at the moment one questions come in my mind:

Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris, But, in your ioctls you want to deal with [file, offset, len], not directly with block numbers. COW means that blocks can move around without you knowing, and some of the btrfs internals will COW files in order to relocate storage. So, what you want is a dedup file (or files)

Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Chris Mason
On Wed, 2009-04-29 at 15:58 +0200, Thomas Glanzmann wrote: Hello Chris, But, in your ioctls you want to deal with [file, offset, len], not directly with block numbers. COW means that blocks can move around without you knowing, and some of the btrfs internals will COW files in order to

Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris, Your database should know, and the ioctl could check to see if the source and destination already point to the same thing before doing anything expensive. I see. So, if I only have file, offset, len and not the block number, is there a way from userland to tell if two blocks

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread jim owens
Andrey Kuzmin wrote: On Tue, Apr 28, 2009 at 2:02 PM, Chris Mason chris.ma...@oracle.com wrote: On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote: Hello Chris, There is a btrfs ioctl to clone individual files, and this could be used to implement an online dedup. But, since it is

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Chris, what blocksizes can I choose with btrfs? Do you think that it is possible for an outsider like me to submit patches to btrfs which enable dedup in three fulltime days? Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Tomasz Chmielewski
Thomas Glanzmann schrieb: 300 Gbyte of used storage of several productive VMs with the following Operatings systems running: \begin{itemize} \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5) \item SuSE Linux 32 and 64 Bit (SLES 9 and 10) \item Windows 2003 Std.

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Edward Shishkin
Tomasz Chmielewski wrote: Thomas Glanzmann schrieb: 300 Gbyte of used storage of several productive VMs with the following Operatings systems running: \begin{itemize} \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5) \item SuSE Linux 32 and 64 Bit (SLES 9 and 10)

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there enough space reserved so that the change can be done without

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris, Is there a checksum for every block in btrfs? Yes, but they are only crc32c. I see, is it easily possible to exchange that with sha-1 or md5? Is it possible to retrieve these checksums from userland? Not today. The sage developers sent a patch to make an ioctl for this,

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote: Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 19:37 +0200, Thomas Glanzmann wrote: Hello Chris, Is there a checksum for every block in btrfs? Yes, but they are only crc32c. I see, is it easily possible to exchange that with sha-1 or md5? Yes, but for the purposes of dedup, it's not exactly what you want.

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, It is possible, there's room in the metadata for about about 4k of checksum for each 4k of data. The initial btrfs code used sha256, but the real limiting factor is the CPU time used. I see. There a very efficient md5 algorithms out there, for example, especially if the code is

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Heinz-Josef Claes
Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason: On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote: Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Michael Tharp
Thomas Glanzmann wrote: no, I just used the md5 checksum. And even if I have a hash escalation which is highly unlikely it still gives a good house number. I'd start with a crc32 and/or MD5 to find candidate blocks, then do a bytewise comparison before actually merging them. Even the risk of

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris, Right now the blocksize can only be the same as the page size. For this external dedup program you have in mind, you could use any multiple of the page size. perfect. Exactly what I need. Three days is probably not quite enough ;) I'd honestly prefer the dedup happen

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Heinz, It's not only cpu time, it's also memory. You need 32 byte for each 4k block. It needs to be in RAM for performance reason. exactly and that is not going to scale. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, * Thomas Glanzmann tho...@glanzmann.de [090428 22:10]: exactly. And if there is a way to retrieve the already calculated checksums from kernel land, than it would be possible to implement a ,,systemcall'' that gives the kernel a hint of a possible duplicated block (like providing a

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Heinz-Josef Claes
Am Dienstag, 28. April 2009 22:16:19 schrieb Thomas Glanzmann: Hello Heinz, It's not only cpu time, it's also memory. You need 32 byte for each 4k block. It needs to be in RAM for performance reason. exactly and that is not going to scale. Thomas Hi Thomas, I wrote a backup

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 22:52 +0200, Thomas Glanzmann wrote: Hello Heinz, I wrote a backup tool which uses dedup, so I know a little bit about the problem and the performance impact if the checksums are not in memory (optionally in that tool). http://savannah.gnu.org/projects/storebackup

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, - Implement a system call that reports all checksums and unique block identifiers for all stored blocks. This would require storing the larger checksums in the filesystem. It is much better done in the dedup program. I think I misunderstood something here. I

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Dmitri Nikulin
On Wed, Apr 29, 2009 at 3:43 AM, Chris Mason chris.ma...@oracle.com wrote: So you need an extra index either way.  It makes sense to keep the crc32c csums for fast verification of the data read from disk and only use the expensive csums for dedup. What about self-healing? With only a CRC32 to

  1   2   >