Re: [Gluster-devel] Roadmap for afr, ec

2015-09-18 Thread Dan Lambright


- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "fanghuang data" , "Gluster Devel" 
> , "Xavier Hernandez"
> , "Dan Lambright" 
> Sent: Friday, September 18, 2015 1:25:30 AM
> Subject: Re: [Gluster-devel] Roadmap for afr, ec
> 
> 
> 
> On 09/16/2015 03:42 PM, fanghuang.d...@yahoo.com wrote:
> > Hi Pranith,
> >
> > For the EC encoding/decoding algorithm, could we design a plug-in mechanism
> > to make users can choose their own
> > algorithm or can use the third side library just like Ceph? And I am also
> > curious why originally the IDA algorithm
> > is chosen, instead of the common used Reed-Solomon algorithm?
> Pluggability of algorithms is also in plan. I never really bothered to
> check which algorithm was used, and was under the impression that we are
> using reed-solomon nonsystematic erasure codes as told to me by Dan(CCed).

Reed solomon error correction is a general purpose coding technique. Its used 
with scratched compact disks, noisy WANs, as well as erasure encoding. 

The way I read it, Rabin's IDA (information dispersal algorithm) describes a 
process for coding files over networks (distributed systems), but I do not 
think it mandates a particular coding algorithm. So you can plug in Tornado 
codes, XOR Cauchy codes, etc. into the scheme.

So my interpretation would be Xavi implemented nonsystematic IDA using Reed 
Solomon encoding, and we would like to change the implementation to be 
systematic with plug-in algorithms.

My interpretation.. I make no claims to be an expert.

> 
> Pranith
> >   
> > Best Regards,
> > Fang Huang
> >
> >
> >> On Monday, 14 September 2015, 16:30, Pranith Kumar Karampuri
> >>  wrote:
> >>> hi,
> >> Here is a list of common improvements for both ec and afr planned over
> >> the next few months:
> >>
> >> 1) Granular entry self-heals.
> >>Both afr and ec at the moment do lot of readdirs and lookups to
> >> figure out the differences between the directories to perform heals.
> >> Kritika, Ravi, Anuradha and I are discussing about how to prevent this.
> >> The base algo is to store only the names that need heal in
> >> .glusterfs/indices/entry-changes// as links to base
> >> file in .glusterfs/indices/entry-changes of the bricks. So only the
> >> names that need to be healed will be going through name heals.
> >> We want to complete this for 3.8 definitely.
> >>
> >> 2) Granular data self-heals.
> >>At the moment even if a single byte changes in the file afr, ec
> >> read the entire file to fix the problems. We are thinking of preventing
> >> this by remembering where the changes happened on the file in extended
> >> attributes. There will be a new extended attribute on the file which
> >> represents a bit map of the changes and each bit represents a range that
> >> needs healing. This extended attribute will have a maximum size it can
> >> represent, the extra chunks will be represented like shards in
> >> .glusterfs/indices/data-changes/> extended
> >> attribute on
> >> this block will store ranges that need heals.
> >>
> >> For example: If we have extended attribute value maximum size as 4KB and
> >> each bit represents 128KB (i.e. first bit represents changes done from
> >> offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended
> >> attribute we can store changes happening to file upto 4GB (We are
> >> thinking of dynamically increasing the size represented by each bit from
> >> say 4k to 128k, but this is still in design). For changes that are
> >> happening from offset 4GB+1 - 8GB will be stored in extended attribute
> >> of .glusterfs/indices/data-changes/. Changes happening
> >> from offset 8GB+1 to 12GB will be stored in extended attribute of
> >> .glusterfs/indices/data-changes/, (please note that
> >> these files are empty, they will just contain extended attributes) etc.
> >> We want to complete this for 3.8 (stretch goal)
> >>
> >> 3) Performance & throttling improvements for self-heal:
> >>We are also looking into the multi-threaded self-heal daemon patch
> >> by Richard for inclusion in 3.8. We are waiting for the discussions by
> >> Raghavendra G on QoS to be over before coming to any decisions on
> >> throttling.
> >>
> >> After we have compound fops:
> >> Goal here is to come up with compound fops and prevent un-necessary
> >> round trips:
> >> 4) Transaction l

Re: [Gluster-devel] Roadmap for afr, ec

2015-09-17 Thread Pranith Kumar Karampuri



On 09/16/2015 03:42 PM, fanghuang.d...@yahoo.com wrote:

Hi Pranith,

For the EC encoding/decoding algorithm, could we design a plug-in mechanism to 
make users can choose their own
algorithm or can use the third side library just like Ceph? And I am also 
curious why originally the IDA algorithm
is chosen, instead of the common used Reed-Solomon algorithm?
Pluggability of algorithms is also in plan. I never really bothered to 
check which algorithm was used, and was under the impression that we are 
using reed-solomon nonsystematic erasure codes as told to me by Dan(CCed).


Pranith
  
Best Regards,

Fang Huang



On Monday, 14 September 2015, 16:30, Pranith Kumar Karampuri 
 wrote:

hi,

Here is a list of common improvements for both ec and afr planned over
the next few months:

1) Granular entry self-heals.
   Both afr and ec at the moment do lot of readdirs and lookups to
figure out the differences between the directories to perform heals.
Kritika, Ravi, Anuradha and I are discussing about how to prevent this.
The base algo is to store only the names that need heal in
.glusterfs/indices/entry-changes// as links to base
file in .glusterfs/indices/entry-changes of the bricks. So only the
names that need to be healed will be going through name heals.
We want to complete this for 3.8 definitely.

2) Granular data self-heals.
   At the moment even if a single byte changes in the file afr, ec
read the entire file to fix the problems. We are thinking of preventing
this by remembering where the changes happened on the file in extended
attributes. There will be a new extended attribute on the file which
represents a bit map of the changes and each bit represents a range that
needs healing. This extended attribute will have a maximum size it can
represent, the extra chunks will be represented like shards in
.glusterfs/indices/data-changes/> extended
attribute on
this block will store ranges that need heals.

For example: If we have extended attribute value maximum size as 4KB and
each bit represents 128KB (i.e. first bit represents changes done from
offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended
attribute we can store changes happening to file upto 4GB (We are
thinking of dynamically increasing the size represented by each bit from
say 4k to 128k, but this is still in design). For changes that are
happening from offset 4GB+1 - 8GB will be stored in extended attribute
of .glusterfs/indices/data-changes/. Changes happening
from offset 8GB+1 to 12GB will be stored in extended attribute of
.glusterfs/indices/data-changes/, (please note that
these files are empty, they will just contain extended attributes) etc.
We want to complete this for 3.8 (stretch goal)

3) Performance & throttling improvements for self-heal:
   We are also looking into the multi-threaded self-heal daemon patch
by Richard for inclusion in 3.8. We are waiting for the discussions by
Raghavendra G on QoS to be over before coming to any decisions on
throttling.

After we have compound fops:
Goal here is to come up with compound fops and prevent un-necessary
round trips:
4) Transaction latency improvements:
   On afr:
In the unoptimized version of transaction we have: 1) Lock, 2)
Pre-op 3) op 4) Post-op 5) unlock
We will
have: 1)
Lock, 2) Pre-op + op 3) post-op + unlock
 This reduces round trips from 5 to 3 in the un-optimized version
of afr-transaction.
   On EC:
In the unoptimized version (worst case of unaligned write) of
transaction we have: 1) Lock, 2) get version, size xattrs 3) reads of
pre, post unaligned chunks 4) op 5) update version, size 6) unlock
We will
have: 1)
Lock + get version, size xattrs + reads of pre, post unaligned chunks,
2) op  3) update version, size + unlock
 This reduces round trips from 6 to 3 in the un-optimized version
of ec-transaction.

5) Entry self-heal per name latency improvements:
  Before: 1) Lock, 2) lookup to determine if the file needs to be
deleted/created 3) create/delete 4) Unlock
  After: 1) Lock + lookup 2) delete/create + unlock

Roadmap that applies only for EC: for 3.8
- Use SSE2/AVX/NEON extensions when available to speed up Galois Field
calculations
- Use a systematic matrix to improve encoding performance (it will also
improve decoding performance when all bricks are healthy)
- Implement a new algorithm able to detect and repair chunks of data on
the fly.

Roadmap that applies only for AFR:
1) Once granular entry/data heals, throttling are in, we can look at
generalizing Richard's lazy replication patch to be used for Near
synchronous replication between data centers and possibly just the
bricks, haven't looked into the patch myself.

We will be sending out more mails as soon as design completes for each
of these items. We are eagerly waiting for Xavi to come back to get his
comments a

Re: [Gluster-devel] Roadmap for afr, ec

2015-09-16 Thread fanghuang.data
Hi Pranith,

For the EC encoding/decoding algorithm, could we design a plug-in mechanism to 
make users can choose their own 
algorithm or can use the third side library just like Ceph? And I am also 
curious why originally the IDA algorithm 
is chosen, instead of the common used Reed-Solomon algorithm?
 
Best Regards,
Fang Huang


> On Monday, 14 September 2015, 16:30, Pranith Kumar Karampuri 
>  wrote:
> > hi,
> 
> Here is a list of common improvements for both ec and afr planned over 
> the next few months:
> 
> 1) Granular entry self-heals.
>   Both afr and ec at the moment do lot of readdirs and lookups to 
> figure out the differences between the directories to perform heals. 
> Kritika, Ravi, Anuradha and I are discussing about how to prevent this. 
> The base algo is to store only the names that need heal in 
> .glusterfs/indices/entry-changes// as links to base 
> file in .glusterfs/indices/entry-changes of the bricks. So only the 
> names that need to be healed will be going through name heals.
> We want to complete this for 3.8 definitely.
> 
> 2) Granular data self-heals.
>   At the moment even if a single byte changes in the file afr, ec 
> read the entire file to fix the problems. We are thinking of preventing 
> this by remembering where the changes happened on the file in extended 
> attributes. There will be a new extended attribute on the file which 
> represents a bit map of the changes and each bit represents a range that 
> needs healing. This extended attribute will have a maximum size it can 
> represent, the extra chunks will be represented like shards in 
> .glusterfs/indices/data-changes/> extended 
> attribute on 
> this block will store ranges that need heals.
> 
> For example: If we have extended attribute value maximum size as 4KB and 
> each bit represents 128KB (i.e. first bit represents changes done from 
> offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended 
> attribute we can store changes happening to file upto 4GB (We are 
> thinking of dynamically increasing the size represented by each bit from 
> say 4k to 128k, but this is still in design). For changes that are 
> happening from offset 4GB+1 - 8GB will be stored in extended attribute 
> of .glusterfs/indices/data-changes/. Changes happening 
> from offset 8GB+1 to 12GB will be stored in extended attribute of 
> .glusterfs/indices/data-changes/, (please note that 
> these files are empty, they will just contain extended attributes) etc.
> We want to complete this for 3.8 (stretch goal)
> 
> 3) Performance & throttling improvements for self-heal:
>   We are also looking into the multi-threaded self-heal daemon patch 
> by Richard for inclusion in 3.8. We are waiting for the discussions by 
> Raghavendra G on QoS to be over before coming to any decisions on 
> throttling.
> 
> After we have compound fops:
> Goal here is to come up with compound fops and prevent un-necessary 
> round trips:
> 4) Transaction latency improvements:
>   On afr:
>In the unoptimized version of transaction we have: 1) Lock, 2) 
> Pre-op 3) op 4) Post-op 5) unlock
>We will 
> have: 1) 
> Lock, 2) Pre-op + op 3) post-op + unlock
> This reduces round trips from 5 to 3 in the un-optimized version 
> of afr-transaction.
>   On EC:
>In the unoptimized version (worst case of unaligned write) of 
> transaction we have: 1) Lock, 2) get version, size xattrs 3) reads of 
> pre, post unaligned chunks 4) op 5) update version, size 6) unlock
>We will 
> have: 1) 
> Lock + get version, size xattrs + reads of pre, post unaligned chunks, 
> 2) op  3) update version, size + unlock
> This reduces round trips from 6 to 3 in the un-optimized version 
> of ec-transaction.
> 
> 5) Entry self-heal per name latency improvements:
>  Before: 1) Lock, 2) lookup to determine if the file needs to be 
> deleted/created 3) create/delete 4) Unlock
>  After: 1) Lock + lookup 2) delete/create + unlock
> 
> Roadmap that applies only for EC: for 3.8
> - Use SSE2/AVX/NEON extensions when available to speed up Galois Field 
> calculations
> - Use a systematic matrix to improve encoding performance (it will also 
> improve decoding performance when all bricks are healthy)
> - Implement a new algorithm able to detect and repair chunks of data on 
> the fly.
> 
> Roadmap that applies only for AFR:
> 1) Once granular entry/data heals, throttling are in, we can look at 
> generalizing Richard's lazy replication patch to be used for Near 
> synchronous replication between data centers and possibly just the 
> bricks, haven't looked into the patch myself.
> 
> We will be sending out more mails as soon as design completes for each 
> of these items. We are eagerly waiting for Xavi to come back to get his 
> comments as well for how EC will be impacted by the common changes. 
> Feedback