Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-07-05 Thread Pranith Kumar Karampuri
On Tue, Jul 4, 2017 at 1:39 PM, Xavier Hernandez 
wrote:

> Hi Pranith,
>
> On 03/07/17 05:35, Pranith Kumar Karampuri wrote:
>
>> Ashish, Xavi,
>>I think it is better to implement this change as a separate
>> read-after-write caching xlator which we can load between EC and client
>> xlator. That way EC will not get a lot more functionality than necessary
>> and may be this xlator can be used somewhere else in the stack if
>> possible.
>>
>
> while this seems a good way to separate functionalities, it has a big
> problem. If we add a caching xlator between ec and *all* of its subvolumes,
> it will only be able to cache encoded data. So, when ec needs the "cached"
> data, it will need to issue a request to each of its subvolumes and compute
> the decoded data before being able to use it, so we don't avoid the
> decoding overhead.
>
> Also, if we want to make the xlator generic, it will probably cache a lot
> more data than ec really needs. Increasing memory footprint considerably
> for no real use.
>
> Additionally, this new xlator will need to guarantee that the cached data
> is current, so it will need its own locking logic (that would be another
> copy&paste of the existing logic in one of the current xlators) which is
> slow and difficult to maintain, or it will need to intercept and reuse
> locking calls from parent xlators, which can be quite complex since we have
> multiple xlator levels where locks can be taken, not only ec.
>
> This is a relatively simple change to make inside ec, but a very complex
> change (IMO) if we want to do it as a stand-alone xlator and be generic
> enough to be reused and work safely in other places of the stack.
>
> If we want to separate functionalities I think we should create a new
> concept of xlator which is transversal to the "traditional" xlator stack.
>
> Current xlators are linear in the sense that each one operates only at one
> place (it can be moved by reconfiguration, but once instantiated, it always
> work at the same place) and passes data to the next one.
>
> A transversal xlator (or maybe a service xlator would be better) would be
> one not bound to any place of the stack, but could be used by all other
> xlators to implement some service, like caching, multithreading, locking,
> ... these are features that many xlators need but cannot use easily (nor
> efficiently) if they are implicitly implemented in some specific place of
> the stack outside its control.
>
> The transaction framework we already talked, could be though as one of
> these service xlators. Multithreading could also benefit of this approach
> because xlators would have more control about what things can be processed
> by a background thread and which ones not. Probably there are other
> features that could benefit from this approach.
>
> In the case of brick multiplexing, if some xlators are removed from each
> stack and loaded as global services, most probably the memory footprint
> will be lower and the resource usage more optimized.
>

I like the service xlator approach. But I don't think we have enough time
to make it operational in the short term. Let us go with implementation of
this feature in EC for now. I didn't realize the extra cost of decoding
when I thought about the separation. So I guess we will stick to the old
idea for now.


>
> Just an idea...
>
> Xavi
>
>
>> On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey > <mailto:aspan...@redhat.com>> wrote:
>>
>>
>> I think it should be done as we have agreement on basic design.
>>
>> 
>> 
>> *From: *"Pranith Kumar Karampuri" > <mailto:pkara...@redhat.com>>
>> *To: *"Xavier Hernandez" > <mailto:xhernan...@datalab.es>>
>> *Cc: *"Ashish Pandey" > <mailto:aspan...@redhat.com>>, "Gluster Devel"
>> mailto:gluster-devel@gluster.org>>
>> *Sent: *Friday, June 16, 2017 3:50:09 PM
>> *Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes
>>
>>
>>
>>
>> On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez
>> mailto:xhernan...@datalab.es>> wrote:
>>
>> On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>> mailto:xhernan...@datalab.es>
>> <mailto:xhernan...@datalab.es
>>
>> <mailto:xhernan...@datalab.es>>> wrote:
>>
>>   

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-07-04 Thread Xavier Hernandez

Hi Pranith,

On 03/07/17 05:35, Pranith Kumar Karampuri wrote:

Ashish, Xavi,
   I think it is better to implement this change as a separate
read-after-write caching xlator which we can load between EC and client
xlator. That way EC will not get a lot more functionality than necessary
and may be this xlator can be used somewhere else in the stack if possible.


while this seems a good way to separate functionalities, it has a big 
problem. If we add a caching xlator between ec and *all* of its 
subvolumes, it will only be able to cache encoded data. So, when ec 
needs the "cached" data, it will need to issue a request to each of its 
subvolumes and compute the decoded data before being able to use it, so 
we don't avoid the decoding overhead.


Also, if we want to make the xlator generic, it will probably cache a 
lot more data than ec really needs. Increasing memory footprint 
considerably for no real use.


Additionally, this new xlator will need to guarantee that the cached 
data is current, so it will need its own locking logic (that would be 
another copy&paste of the existing logic in one of the current xlators) 
which is slow and difficult to maintain, or it will need to intercept 
and reuse locking calls from parent xlators, which can be quite complex 
since we have multiple xlator levels where locks can be taken, not only ec.


This is a relatively simple change to make inside ec, but a very complex 
change (IMO) if we want to do it as a stand-alone xlator and be generic 
enough to be reused and work safely in other places of the stack.


If we want to separate functionalities I think we should create a new 
concept of xlator which is transversal to the "traditional" xlator stack.


Current xlators are linear in the sense that each one operates only at 
one place (it can be moved by reconfiguration, but once instantiated, it 
always work at the same place) and passes data to the next one.


A transversal xlator (or maybe a service xlator would be better) would 
be one not bound to any place of the stack, but could be used by all 
other xlators to implement some service, like caching, multithreading, 
locking, ... these are features that many xlators need but cannot use 
easily (nor efficiently) if they are implicitly implemented in some 
specific place of the stack outside its control.


The transaction framework we already talked, could be though as one of 
these service xlators. Multithreading could also benefit of this 
approach because xlators would have more control about what things can 
be processed by a background thread and which ones not. Probably there 
are other features that could benefit from this approach.


In the case of brick multiplexing, if some xlators are removed from each 
stack and loaded as global services, most probably the memory footprint 
will be lower and the resource usage more optimized.


Just an idea...

Xavi



On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey mailto:aspan...@redhat.com>> wrote:


I think it should be done as we have agreement on basic design.


*From: *"Pranith Kumar Karampuri" mailto:pkara...@redhat.com>>
*To: *"Xavier Hernandez" mailto:xhernan...@datalab.es>>
*Cc: *"Ashish Pandey" mailto:aspan...@redhat.com>>, "Gluster Devel"
mailto:gluster-devel@gluster.org>>
*Sent: *Friday, June 16, 2017 3:50:09 PM
*Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes




On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez
mailto:xhernan...@datalab.es>> wrote:

On 16/06/17 10:51, Pranith Kumar Karampuri wrote:



On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
mailto:xhernan...@datalab.es>
<mailto:xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>> wrote:

On 15/06/17 11:50, Pranith Kumar Karampuri wrote:



On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
mailto:aspan...@redhat.com>
<mailto:aspan...@redhat.com <mailto:aspan...@redhat.com>>
<mailto:aspan...@redhat.com
<mailto:aspan...@redhat.com> <mailto:aspan...@redhat.com
<mailto:aspan...@redhat.com>>>> wrote:

Hi All,

We have been facing some issues in disperse (EC)
volume.
We know that currently EC is not good for random
IO as it
requires
READ-MODIFY-WRITE fop
cycle if an offset and offset+length falls in
the middle of
strip size.

Unfortunately, it could also happen with
sequential writes.
C

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-07-03 Thread Ashish Pandey

I think it is a good Idea. 
May be we can add more enhancement in this xlator to improve things in future. 

- Original Message -

From: "Pranith Kumar Karampuri"  
To: "Ashish Pandey"  
Cc: "Xavier Hernandez" , "Gluster Devel" 
 
Sent: Monday, July 3, 2017 9:05:54 AM 
Subject: Re: [Gluster-devel] Disperse volume : Sequential Writes 

Ashish, Xavi, 
I think it is better to implement this change as a separate read-after-write 
caching xlator which we can load between EC and client xlator. That way EC will 
not get a lot more functionality than necessary and may be this xlator can be 
used somewhere else in the stack if possible. 

On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey < aspan...@redhat.com > wrote: 




I think it should be done as we have agreement on basic design. 


From: "Pranith Kumar Karampuri" < pkara...@redhat.com > 
To: "Xavier Hernandez" < xhernan...@datalab.es > 
Cc: "Ashish Pandey" < aspan...@redhat.com >, "Gluster Devel" < 
gluster-devel@gluster.org > 
Sent: Friday, June 16, 2017 3:50:09 PM 
Subject: Re: [Gluster-devel] Disperse volume : Sequential Writes 




On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez < xhernan...@datalab.es > 
wrote: 


On 16/06/17 10:51, Pranith Kumar Karampuri wrote: 




On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez 
< xhernan...@datalab.es > wrote: 

On 15/06/17 11:50, Pranith Kumar Karampuri wrote: 



On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey 
< aspan...@redhat.com  
>> wrote: 

Hi All, 

We have been facing some issues in disperse (EC) volume. 
We know that currently EC is not good for random IO as it 
requires 
READ-MODIFY-WRITE fop 
cycle if an offset and offset+length falls in the middle of 
strip size. 

Unfortunately, it could also happen with sequential writes. 
Consider an EC volume with configuration 4+2. The stripe 
size for 
this would be 512 * 4 = 2048. That is, 2048 bytes of user data 
stored in one stripe. 
Let's say 2048 + 512 = 2560 bytes are already written on this 
volume. 512 Bytes would be in second stripe. 
Now, if there are sequential writes with offset 2560 and of 
size 1 
Byte, we have to read the whole stripe, encode it with 1 
Byte and 
then again have to write it back. 
Next, write with offset 2561 and size of 1 Byte will again 
READ-MODIFY-WRITE the whole stripe. This is causing bad 
performance. 

There are some tools and scenario's where such kind of load is 
coming and users are not aware of that. 
Example: fio and zip 

Solution: 
One possible solution to deal with this issue is to keep 
last stripe 
in memory. 
This way, we need not to read it again and we can save READ fop 
going over the network. 
Considering the above example, we have to keep last 2048 bytes 
(maximum) in memory per file. This should not be a big 
deal as we already keep some data like xattr's and size info in 
memory and based on that we take decisions. 

Please provide your thoughts on this and also if you have 
any other 
solution. 


Just adding more details. 
The stripe will be in memory only when lock on the inode is active. 


I think that's ok. 

One 
thing we are yet to decide on is: do we want to read the stripe 
everytime we get the lock or just after an extending write is 
performed. 
I am thinking keeping the stripe in memory just after an 
extending write 
is better as it doesn't involve extra network operation. 


I wouldn't read the last stripe unconditionally every time we lock 
the inode. There's no benefit at all on random writes (in fact it's 
worse) and a sequential write will issue the read anyway when 
needed. The only difference is a small delay for the first operation 
after a lock. 


Yes, perfect. 



What I would do is to keep the last stripe of every write (we can 
consider to do it per fd), even if it's not the last stripe of the 
file (to also optimize sequential rewrites). 


Ah! good point. But if we remember it per fd, one fd's cached data can 
be over-written by another fd on the disk so we need to also do cache 
invalidation. 



We only cache data if we have the inodelk, so all related fd's must be from the 
same client, and we'll control all its writes so cache invalidation in this 
case is pretty easy. 

There exists the possibility to have two fd's from the same client writing to 
the same region. To control this we would need some range checking in the 
writes, but all this is local, so it's easy to control it. 

Anyway, this is probably not a common case, so we could start by caching only 
the last stripe of the last write, ignoring the fd. 



May be implementation should consider this possibility. 
Yet to think about how to do this. But it is a good point. We should 
consider this. 


Maybe we could keep a list of cached stripes sorted by offset in the inode (if 
the maximum number of en

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-07-02 Thread Pranith Kumar Karampuri
Ashish, Xavi,
   I think it is better to implement this change as a separate
read-after-write caching xlator which we can load between EC and client
xlator. That way EC will not get a lot more functionality than necessary
and may be this xlator can be used somewhere else in the stack if possible.

On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey  wrote:

>
> I think it should be done as we have agreement on basic design.
>
> --
> *From: *"Pranith Kumar Karampuri" 
> *To: *"Xavier Hernandez" 
> *Cc: *"Ashish Pandey" , "Gluster Devel" <
> gluster-devel@gluster.org>
> *Sent: *Friday, June 16, 2017 3:50:09 PM
> *Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes
>
>
>
>
> On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez 
> wrote:
>
>> On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>>
>>>
>>>
>>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>>> mailto:xhernan...@datalab.es>> wrote:
>>>
>>> On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>>>
>>>
>>>
>>> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
>>> mailto:aspan...@redhat.com>
>>> <mailto:aspan...@redhat.com <mailto:aspan...@redhat.com>>>
>>> wrote:
>>>
>>> Hi All,
>>>
>>> We have been facing some issues in disperse (EC) volume.
>>> We know that currently EC is not good for random IO as it
>>> requires
>>> READ-MODIFY-WRITE fop
>>> cycle if an offset and offset+length falls in the middle of
>>> strip size.
>>>
>>> Unfortunately, it could also happen with sequential writes.
>>> Consider an EC volume with configuration  4+2. The stripe
>>> size for
>>> this would be 512 * 4 = 2048. That is, 2048 bytes of user
>>> data
>>> stored in one stripe.
>>> Let's say 2048 + 512 = 2560 bytes are already written on this
>>> volume. 512 Bytes would be in second stripe.
>>> Now, if there are sequential writes with offset 2560 and of
>>> size 1
>>> Byte, we have to read the whole stripe, encode it with 1
>>> Byte and
>>> then again have to write it back.
>>> Next, write with offset 2561 and size of 1 Byte will again
>>> READ-MODIFY-WRITE the whole stripe. This is causing bad
>>> performance.
>>>
>>> There are some tools and scenario's where such kind of load
>>> is
>>> coming and users are not aware of that.
>>> Example: fio and zip
>>>
>>> Solution:
>>> One possible solution to deal with this issue is to keep
>>> last stripe
>>> in memory.
>>> This way, we need not to read it again and we can save READ
>>> fop
>>> going over the network.
>>> Considering the above example, we have to keep last 2048
>>> bytes
>>> (maximum)  in memory per file. This should not be a big
>>> deal as we already keep some data like xattr's and size info
>>> in
>>> memory and based on that we take decisions.
>>>
>>> Please provide your thoughts on this and also if you have
>>> any other
>>> solution.
>>>
>>>
>>> Just adding more details.
>>> The stripe will be in memory only when lock on the inode is
>>> active.
>>>
>>>
>>> I think that's ok.
>>>
>>> One
>>> thing we are yet to decide on is: do we want to read the stripe
>>> everytime we get the lock or just after an extending write is
>>> performed.
>>> I am thinking keeping the stripe in memory just after an
>>> extending write
>>> is better as it doesn't involve extra network operation.
>>>
>>>
>>> I wouldn't read the last stripe unconditionally every time we lock
>>> the inode. There's no benefit at all on random writes (in fact it's
>>> worse) and a sequential write will issue the read anyway when
>>> needed. The only difference is a sma

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-06-16 Thread Ashish Pandey

I think it should be done as we have agreement on basic design. 

- Original Message -

From: "Pranith Kumar Karampuri"  
To: "Xavier Hernandez"  
Cc: "Ashish Pandey" , "Gluster Devel" 
 
Sent: Friday, June 16, 2017 3:50:09 PM 
Subject: Re: [Gluster-devel] Disperse volume : Sequential Writes 



On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez < xhernan...@datalab.es > 
wrote: 


On 16/06/17 10:51, Pranith Kumar Karampuri wrote: 




On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez 
< xhernan...@datalab.es > wrote: 

On 15/06/17 11:50, Pranith Kumar Karampuri wrote: 



On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey 
< aspan...@redhat.com  
>> wrote: 

Hi All, 

We have been facing some issues in disperse (EC) volume. 
We know that currently EC is not good for random IO as it 
requires 
READ-MODIFY-WRITE fop 
cycle if an offset and offset+length falls in the middle of 
strip size. 

Unfortunately, it could also happen with sequential writes. 
Consider an EC volume with configuration 4+2. The stripe 
size for 
this would be 512 * 4 = 2048. That is, 2048 bytes of user data 
stored in one stripe. 
Let's say 2048 + 512 = 2560 bytes are already written on this 
volume. 512 Bytes would be in second stripe. 
Now, if there are sequential writes with offset 2560 and of 
size 1 
Byte, we have to read the whole stripe, encode it with 1 
Byte and 
then again have to write it back. 
Next, write with offset 2561 and size of 1 Byte will again 
READ-MODIFY-WRITE the whole stripe. This is causing bad 
performance. 

There are some tools and scenario's where such kind of load is 
coming and users are not aware of that. 
Example: fio and zip 

Solution: 
One possible solution to deal with this issue is to keep 
last stripe 
in memory. 
This way, we need not to read it again and we can save READ fop 
going over the network. 
Considering the above example, we have to keep last 2048 bytes 
(maximum) in memory per file. This should not be a big 
deal as we already keep some data like xattr's and size info in 
memory and based on that we take decisions. 

Please provide your thoughts on this and also if you have 
any other 
solution. 


Just adding more details. 
The stripe will be in memory only when lock on the inode is active. 


I think that's ok. 

One 
thing we are yet to decide on is: do we want to read the stripe 
everytime we get the lock or just after an extending write is 
performed. 
I am thinking keeping the stripe in memory just after an 
extending write 
is better as it doesn't involve extra network operation. 


I wouldn't read the last stripe unconditionally every time we lock 
the inode. There's no benefit at all on random writes (in fact it's 
worse) and a sequential write will issue the read anyway when 
needed. The only difference is a small delay for the first operation 
after a lock. 


Yes, perfect. 



What I would do is to keep the last stripe of every write (we can 
consider to do it per fd), even if it's not the last stripe of the 
file (to also optimize sequential rewrites). 


Ah! good point. But if we remember it per fd, one fd's cached data can 
be over-written by another fd on the disk so we need to also do cache 
invalidation. 



We only cache data if we have the inodelk, so all related fd's must be from the 
same client, and we'll control all its writes so cache invalidation in this 
case is pretty easy. 

There exists the possibility to have two fd's from the same client writing to 
the same region. To control this we would need some range checking in the 
writes, but all this is local, so it's easy to control it. 

Anyway, this is probably not a common case, so we could start by caching only 
the last stripe of the last write, ignoring the fd. 



May be implementation should consider this possibility. 
Yet to think about how to do this. But it is a good point. We should 
consider this. 



Maybe we could keep a list of cached stripes sorted by offset in the inode (if 
the maximum number of entries is small, we could keep the list not sorted). 
Each fd should store the offset of the last write. Cached stripes should have a 
ref counter just to account for the case that two fd's point to the same 
offset. 

When a new write arrives, we check the offset stored in the fd and see if it 
corresponds to a sequential write. If so, we look at the inode list to find the 
cached stripe, otherwise we can release the cached stripe. 

We can limit the number of cached entries and release the least recently used 
when we reach some maximum. 




Yeah, this works :-). 
Ashish, 
Can all of this be implemented by 3.12? 








One thing I've observed is that a 'dd' with block size of 1MB gets 
split into multiple 128KB blocks that are sent in parallel and not 
necessarily processed in the sequential order. This means that big 
block sizes won&#

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-06-16 Thread Pranith Kumar Karampuri
On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez 
wrote:

> On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>
>>
>>
>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>> mailto:xhernan...@datalab.es>> wrote:
>>
>> On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
>> mailto:aspan...@redhat.com>
>> >> wrote:
>>
>> Hi All,
>>
>> We have been facing some issues in disperse (EC) volume.
>> We know that currently EC is not good for random IO as it
>> requires
>> READ-MODIFY-WRITE fop
>> cycle if an offset and offset+length falls in the middle of
>> strip size.
>>
>> Unfortunately, it could also happen with sequential writes.
>> Consider an EC volume with configuration  4+2. The stripe
>> size for
>> this would be 512 * 4 = 2048. That is, 2048 bytes of user data
>> stored in one stripe.
>> Let's say 2048 + 512 = 2560 bytes are already written on this
>> volume. 512 Bytes would be in second stripe.
>> Now, if there are sequential writes with offset 2560 and of
>> size 1
>> Byte, we have to read the whole stripe, encode it with 1
>> Byte and
>> then again have to write it back.
>> Next, write with offset 2561 and size of 1 Byte will again
>> READ-MODIFY-WRITE the whole stripe. This is causing bad
>> performance.
>>
>> There are some tools and scenario's where such kind of load is
>> coming and users are not aware of that.
>> Example: fio and zip
>>
>> Solution:
>> One possible solution to deal with this issue is to keep
>> last stripe
>> in memory.
>> This way, we need not to read it again and we can save READ
>> fop
>> going over the network.
>> Considering the above example, we have to keep last 2048 bytes
>> (maximum)  in memory per file. This should not be a big
>> deal as we already keep some data like xattr's and size info
>> in
>> memory and based on that we take decisions.
>>
>> Please provide your thoughts on this and also if you have
>> any other
>> solution.
>>
>>
>> Just adding more details.
>> The stripe will be in memory only when lock on the inode is
>> active.
>>
>>
>> I think that's ok.
>>
>> One
>> thing we are yet to decide on is: do we want to read the stripe
>> everytime we get the lock or just after an extending write is
>> performed.
>> I am thinking keeping the stripe in memory just after an
>> extending write
>> is better as it doesn't involve extra network operation.
>>
>>
>> I wouldn't read the last stripe unconditionally every time we lock
>> the inode. There's no benefit at all on random writes (in fact it's
>> worse) and a sequential write will issue the read anyway when
>> needed. The only difference is a small delay for the first operation
>> after a lock.
>>
>>
>> Yes, perfect.
>>
>>
>>
>> What I would do is to keep the last stripe of every write (we can
>> consider to do it per fd), even if it's not the last stripe of the
>> file (to also optimize sequential rewrites).
>>
>>
>> Ah! good point. But if we remember it per fd, one fd's cached data can
>> be over-written by another fd on the disk so we need to also do cache
>> invalidation.
>>
>
> We only cache data if we have the inodelk, so all related fd's must be
> from the same client, and we'll control all its writes so cache
> invalidation in this case is pretty easy.
>
> There exists the possibility to have two fd's from the same client writing
> to the same region. To control this we would need some range checking in
> the writes, but all this is local, so it's easy to control it.
>
> Anyway, this is probably not a common case, so we could start by caching
> only the last stripe of the last write, ignoring the fd.
>
> May be implementation should consider this possibility.
>> Yet to think about how to do this. But it is a good point. We should
>> consider this.
>>
>
> Maybe we could keep a list of cached stripes sorted by offset in the inode
> (if the maximum number of entries is small, we could keep the list not
> sorted). Each fd should store the offset of the last write. Cached stripes
> should have a ref counter just to account for the case that two fd's point
> to the same offset.
>
> When a new write arrives, we check the offset stored in the fd and see if
> it corresponds to a sequential write. If so, we look at the inode list to
> find the cached stripe, otherwise we can release the cached stripe.
>
> We can limit the number of cached entries and rele

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-06-16 Thread Xavier Hernandez

On 16/06/17 10:51, Pranith Kumar Karampuri wrote:



On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
mailto:xhernan...@datalab.es>> wrote:

On 15/06/17 11:50, Pranith Kumar Karampuri wrote:



On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
mailto:aspan...@redhat.com>
>> wrote:

Hi All,

We have been facing some issues in disperse (EC) volume.
We know that currently EC is not good for random IO as it
requires
READ-MODIFY-WRITE fop
cycle if an offset and offset+length falls in the middle of
strip size.

Unfortunately, it could also happen with sequential writes.
Consider an EC volume with configuration  4+2. The stripe
size for
this would be 512 * 4 = 2048. That is, 2048 bytes of user data
stored in one stripe.
Let's say 2048 + 512 = 2560 bytes are already written on this
volume. 512 Bytes would be in second stripe.
Now, if there are sequential writes with offset 2560 and of
size 1
Byte, we have to read the whole stripe, encode it with 1
Byte and
then again have to write it back.
Next, write with offset 2561 and size of 1 Byte will again
READ-MODIFY-WRITE the whole stripe. This is causing bad
performance.

There are some tools and scenario's where such kind of load is
coming and users are not aware of that.
Example: fio and zip

Solution:
One possible solution to deal with this issue is to keep
last stripe
in memory.
This way, we need not to read it again and we can save READ fop
going over the network.
Considering the above example, we have to keep last 2048 bytes
(maximum)  in memory per file. This should not be a big
deal as we already keep some data like xattr's and size info in
memory and based on that we take decisions.

Please provide your thoughts on this and also if you have
any other
solution.


Just adding more details.
The stripe will be in memory only when lock on the inode is active.


I think that's ok.

One
thing we are yet to decide on is: do we want to read the stripe
everytime we get the lock or just after an extending write is
performed.
I am thinking keeping the stripe in memory just after an
extending write
is better as it doesn't involve extra network operation.


I wouldn't read the last stripe unconditionally every time we lock
the inode. There's no benefit at all on random writes (in fact it's
worse) and a sequential write will issue the read anyway when
needed. The only difference is a small delay for the first operation
after a lock.


Yes, perfect.



What I would do is to keep the last stripe of every write (we can
consider to do it per fd), even if it's not the last stripe of the
file (to also optimize sequential rewrites).


Ah! good point. But if we remember it per fd, one fd's cached data can
be over-written by another fd on the disk so we need to also do cache
invalidation.


We only cache data if we have the inodelk, so all related fd's must be 
from the same client, and we'll control all its writes so cache 
invalidation in this case is pretty easy.


There exists the possibility to have two fd's from the same client 
writing to the same region. To control this we would need some range 
checking in the writes, but all this is local, so it's easy to control it.


Anyway, this is probably not a common case, so we could start by caching 
only the last stripe of the last write, ignoring the fd.



May be implementation should consider this possibility.
Yet to think about how to do this. But it is a good point. We should
consider this.


Maybe we could keep a list of cached stripes sorted by offset in the 
inode (if the maximum number of entries is small, we could keep the list 
not sorted). Each fd should store the offset of the last write. Cached 
stripes should have a ref counter just to account for the case that two 
fd's point to the same offset.


When a new write arrives, we check the offset stored in the fd and see 
if it corresponds to a sequential write. If so, we look at the inode 
list to find the cached stripe, otherwise we can release the cached stripe.


We can limit the number of cached entries and release the least recently 
used when we reach some maximum.






One thing I've observed is that a 'dd' with block size of 1MB gets
split into multiple 128KB blocks that are sent in parallel and not
necessarily processed in the sequential order. This means that big
block sizes won't benefit much from this optimization since they
will be

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-06-16 Thread Pranith Kumar Karampuri
On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez 
wrote:

> On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>
>>
>>
>> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey > > wrote:
>>
>> Hi All,
>>
>> We have been facing some issues in disperse (EC) volume.
>> We know that currently EC is not good for random IO as it requires
>> READ-MODIFY-WRITE fop
>> cycle if an offset and offset+length falls in the middle of strip
>> size.
>>
>> Unfortunately, it could also happen with sequential writes.
>> Consider an EC volume with configuration  4+2. The stripe size for
>> this would be 512 * 4 = 2048. That is, 2048 bytes of user data
>> stored in one stripe.
>> Let's say 2048 + 512 = 2560 bytes are already written on this
>> volume. 512 Bytes would be in second stripe.
>> Now, if there are sequential writes with offset 2560 and of size 1
>> Byte, we have to read the whole stripe, encode it with 1 Byte and
>> then again have to write it back.
>> Next, write with offset 2561 and size of 1 Byte will again
>> READ-MODIFY-WRITE the whole stripe. This is causing bad performance.
>>
>> There are some tools and scenario's where such kind of load is
>> coming and users are not aware of that.
>> Example: fio and zip
>>
>> Solution:
>> One possible solution to deal with this issue is to keep last stripe
>> in memory.
>> This way, we need not to read it again and we can save READ fop
>> going over the network.
>> Considering the above example, we have to keep last 2048 bytes
>> (maximum)  in memory per file. This should not be a big
>> deal as we already keep some data like xattr's and size info in
>> memory and based on that we take decisions.
>>
>> Please provide your thoughts on this and also if you have any other
>> solution.
>>
>>
>> Just adding more details.
>> The stripe will be in memory only when lock on the inode is active.
>>
>
> I think that's ok.
>
> One
>> thing we are yet to decide on is: do we want to read the stripe
>> everytime we get the lock or just after an extending write is performed.
>> I am thinking keeping the stripe in memory just after an extending write
>> is better as it doesn't involve extra network operation.
>>
>
> I wouldn't read the last stripe unconditionally every time we lock the
> inode. There's no benefit at all on random writes (in fact it's worse) and
> a sequential write will issue the read anyway when needed. The only
> difference is a small delay for the first operation after a lock.
>

Yes, perfect.


>
> What I would do is to keep the last stripe of every write (we can consider
> to do it per fd), even if it's not the last stripe of the file (to also
> optimize sequential rewrites).
>

Ah! good point. But if we remember it per fd, one fd's cached data can be
over-written by another fd on the disk so we need to also do cache
invalidation. May be implementation should consider this possibility. Yet
to think about how to do this. But it is a good point. We should consider
this.


>
> One thing I've observed is that a 'dd' with block size of 1MB gets split
> into multiple 128KB blocks that are sent in parallel and not necessarily
> processed in the sequential order. This means that big block sizes won't
> benefit much from this optimization since they will be seen as partially
> non-sequential writes. Anyway the change won't hurt.
>

In this case as per the solution we won't cache anything right? Because we
didn't request anything from the disk. We will only keep the data in cache
if it is not aligned write which is at the current EOF. At least that is
what I had in mind.


>
> Xavi
>
>
>>
>>
>>
>> ---
>> Ashish
>>
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org 
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>> 
>>
>>
>>
>>
>> --
>> Pranith
>>
>
>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-06-15 Thread Xavier Hernandez

On 15/06/17 11:50, Pranith Kumar Karampuri wrote:



On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey mailto:aspan...@redhat.com>> wrote:

Hi All,

We have been facing some issues in disperse (EC) volume.
We know that currently EC is not good for random IO as it requires
READ-MODIFY-WRITE fop
cycle if an offset and offset+length falls in the middle of strip size.

Unfortunately, it could also happen with sequential writes.
Consider an EC volume with configuration  4+2. The stripe size for
this would be 512 * 4 = 2048. That is, 2048 bytes of user data
stored in one stripe.
Let's say 2048 + 512 = 2560 bytes are already written on this
volume. 512 Bytes would be in second stripe.
Now, if there are sequential writes with offset 2560 and of size 1
Byte, we have to read the whole stripe, encode it with 1 Byte and
then again have to write it back.
Next, write with offset 2561 and size of 1 Byte will again
READ-MODIFY-WRITE the whole stripe. This is causing bad performance.

There are some tools and scenario's where such kind of load is
coming and users are not aware of that.
Example: fio and zip

Solution:
One possible solution to deal with this issue is to keep last stripe
in memory.
This way, we need not to read it again and we can save READ fop
going over the network.
Considering the above example, we have to keep last 2048 bytes
(maximum)  in memory per file. This should not be a big
deal as we already keep some data like xattr's and size info in
memory and based on that we take decisions.

Please provide your thoughts on this and also if you have any other
solution.


Just adding more details.
The stripe will be in memory only when lock on the inode is active.


I think that's ok.


One
thing we are yet to decide on is: do we want to read the stripe
everytime we get the lock or just after an extending write is performed.
I am thinking keeping the stripe in memory just after an extending write
is better as it doesn't involve extra network operation.


I wouldn't read the last stripe unconditionally every time we lock the 
inode. There's no benefit at all on random writes (in fact it's worse) 
and a sequential write will issue the read anyway when needed. The only 
difference is a small delay for the first operation after a lock.


What I would do is to keep the last stripe of every write (we can 
consider to do it per fd), even if it's not the last stripe of the file 
(to also optimize sequential rewrites).


One thing I've observed is that a 'dd' with block size of 1MB gets split 
into multiple 128KB blocks that are sent in parallel and not necessarily 
processed in the sequential order. This means that big block sizes won't 
benefit much from this optimization since they will be seen as partially 
non-sequential writes. Anyway the change won't hurt.


Xavi






---
Ashish



___
Gluster-devel mailing list
Gluster-devel@gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-devel





--
Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Disperse volume : Sequential Writes

2017-06-15 Thread Pranith Kumar Karampuri
On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey  wrote:

> Hi All,
>
> We have been facing some issues in disperse (EC) volume.
> We know that currently EC is not good for random IO as it requires
> READ-MODIFY-WRITE fop
> cycle if an offset and offset+length falls in the middle of strip size.
>
> Unfortunately, it could also happen with sequential writes.
> Consider an EC volume with configuration  4+2. The stripe size for this
> would be 512 * 4 = 2048. That is, 2048 bytes of user data stored in one
> stripe.
> Let's say 2048 + 512 = 2560 bytes are already written on this volume. 512
> Bytes would be in second stripe.
> Now, if there are sequential writes with offset 2560 and of size 1 Byte,
> we have to read the whole stripe, encode it with 1 Byte and then again have
> to write it back.
> Next, write with offset 2561 and size of 1 Byte will again
> READ-MODIFY-WRITE the whole stripe. This is causing bad performance.
>
> There are some tools and scenario's where such kind of load is coming and
> users are not aware of that.
> Example: fio and zip
>
> Solution:
> One possible solution to deal with this issue is to keep last stripe in
> memory.
> This way, we need not to read it again and we can save READ fop going over
> the network.
> Considering the above example, we have to keep last 2048 bytes (maximum)
> in memory per file. This should not be a big
> deal as we already keep some data like xattr's and size info in memory and
> based on that we take decisions.
>
> Please provide your thoughts on this and also if you have any other
> solution.
>

Just adding more details.
The stripe will be in memory only when lock on the inode is active. One
thing we are yet to decide on is: do we want to read the stripe everytime
we get the lock or just after an extending write is performed. I am
thinking keeping the stripe in memory just after an extending write is
better as it doesn't involve extra network operation.



>
> ---
> Ashish
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Disperse volume : Sequential Writes

2017-06-14 Thread Ashish Pandey
Hi All, 

We have been facing some issues in disperse (EC) volume. 
We know that currently EC is not good for random IO as it requires 
READ-MODIFY-WRITE fop 
cycle if an offset and offset+length falls in the middle of strip size. 

Unfortunately, it could also happen with sequential writes. 
Consider an EC volume with configuration 4+2. The stripe size for this would be 
512 * 4 = 2048. That is, 2048 bytes of user data stored in one stripe. 
Let's say 2048 + 512 = 2560 bytes are already written on this volume. 512 Bytes 
would be in second stripe. 
Now, if there are sequential writes with offset 2560 and of size 1 Byte, we 
have to read the whole stripe, encode it with 1 Byte and then again have to 
write it back. 
Next, write with offset 2561 and size of 1 Byte will again READ-MODIFY-WRITE 
the whole stripe. This is causing bad performance. 

There are some tools and scenario's where such kind of load is coming and users 
are not aware of that. 
Example: fio and zip 

Solution: 
One possible solution to deal with this issue is to keep last stripe in memory. 
This way, we need not to read it again and we can save READ fop going over the 
network. 
Considering the above example, we have to keep last 2048 bytes (maximum) in 
memory per file. This should not be a big 
deal as we already keep some data like xattr's and size info in memory and 
based on that we take decisions. 

Please provide your thoughts on this and also if you have any other solution. 

--- 
Ashish 


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel