[Gluster-users] Sharding on 7.x - file sizes are wrong after a large copy.

2020-07-08 Thread Claus Jeppesen
In April of this year I reported the problem using sharding on gluster 7.4:


We're using GlusterFS in a replicated brick setup with 2 bricks with
sharding turned on (shardsize 128MB).

There is something funny going on as we can see that if we copy large VM
files to the volume we can end up with files that are a bit larger than the
source files DEPENDING on the speed with which we copied the files - e.g.:

   dd if=SOURCE bs=1M | pv -L NNm | ssh gluster_server "dd
of=/gluster/VOL_NAME/TARGET
bs=1M"

It seems that if NN is <= 25 (i.e. 25 MB/s) the size of SOURCE and TARGET
will be the same.

If we crank NN to, say, 50 we sometimes risk that a 25G file ends up having
a slightly larger size, e.g. 26844413952 or 26844233728 - larger than the
expected 26843545600.
Unfortunately this is not an illusion ! If we dd the files out of Gluster
we will receive the amount of data that 'ls' showed us.

In the brick directory (incl .shard directory) we have the expected amount
of shards for a 25G files (200) with size precisely equal to 128MB - but
there is an additional 0 size shard file created.

Has anyone else seen a phenomenon like this ?


After upgrade to 7.6 we're still seeing this problem - now,  the extra
bytes that are appearing can be removed using truncate in the mounted
gluster volume, and md5sum can confirm that after truncate the content is
identical to the source - however, it may point to an underlying issue.

I hope someone can reproduce this behaviour,

Thanx,

Claus.

-- 
*Claus Jeppesen*
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
www.datto.com




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Sharding on 7.4 - filesizes may be wrong

2020-04-02 Thread Dmitry Antipov

On 4/1/20 6:20 PM, Claus Jeppesen wrote:


Has anyone else seen a phenomenon like this ?


Well, this one (observed on 8dev git) may be related:

# qemu-img convert gluster://192.168.111.2/TEST/qcow-32G.qcow2 
gluster://192.168.111.3/TEST/out-32G.raw
[2020-04-02 09:38:08.968750] E [ec-inode-write.c:2004:ec_writev_start] (-->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x12d92) [0x7f3d02a61d92] 
-->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x365fd) [0x7f3d02a855fd] -->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x35ccd) [0x7f3d02a84ccd] ) 0-: Assertion failed: 
ec_get_inode_size(fop, fop->fd->inode, )
[2020-04-02 09:38:08.971827] E [ec-inode-write.c:2004:ec_writev_start] (-->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x12d92) [0x7f3d02a61d92] 
-->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x365fd) [0x7f3d02a855fd] -->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x35ccd) [0x7f3d02a84ccd] ) 0-: Assertion failed: 
ec_get_inode_size(fop, fop->fd->inode, )
[2020-04-02 09:38:08.975386] E [ec-inode-write.c:2201:ec_manager_writev] (-->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x12f9f) [0x7f3d02a61f9f] 
-->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x12d92) [0x7f3d02a61d92] -->/usr/lib64/glusterfs/8dev/xlator/cluster/disperse.so(+0x3666d) [0x7f3d02a8566d] ) 0-: Assertion failed: 
__ec_get_inode_size(fop, fop->fd->inode, >iatt[0].ia_size)


# gluster volume info

Volume Name: TEST
Type: Distributed-Disperse
Volume ID: 1b6c4980-dad5-4daa-b662-53995470f891
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x (2 + 1) = 15
Transport-type: tcp
Bricks:
Brick1: 192.168.111.1:/vair/SSD-
Brick2: 192.168.111.2:/vair/SSD-
Brick3: 192.168.111.3:/vair/SSD-
Brick4: 192.168.111.4:/vair/SSD-
Brick5: 192.168.111.1:/vair/SSD-0001
Brick6: 192.168.111.2:/vair/SSD-0001
Brick7: 192.168.111.3:/vair/SSD-0001
Brick8: 192.168.111.4:/vair/SSD-0001
Brick9: 192.168.111.1:/vair/SSD-0002
Brick10: 192.168.111.2:/vair/SSD-0002
Brick11: 192.168.111.3:/vair/SSD-0002
Brick12: 192.168.111.4:/vair/SSD-0002
Brick13: 192.168.111.1:/vair/SSD-0003
Brick14: 192.168.111.2:/vair/SSD-0003
Brick15: 192.168.111.3:/vair/SSD-0003
Options Reconfigured:
features.shard-block-size: 128MB
features.shard: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on

Dmitry




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Sharding on 7.4 - filesizes may be wrong

2020-04-01 Thread Claus Jeppesen
We're using GlusterFS in a replicated brick setup with 2 bricks with
sharding turned on (shardsize 128MB).

There is something funny going on as we can see that if we copy large VM
files to the volume we can end up with files that are a bit larger than the
source files DEPENDING on the speed with which we copied the files - e.g.:

   dd if=SOURCE bs=1M | pv -L NNm | ssh gluster_server "dd
of=/gluster/VOL_NAME/TARGET bs=1M"

It seems that if NN is <= 25 (i.e. 25 MB/s) the size of SOURCE and TARGET
will be the same.

If we crank NN to, say, 50 we sometimes risk that a 25G file ends up having
a slightly larger size, e.g. 26844413952 or 26844233728 - larger than the
expected 26843545600.
Unfortunately this is not an illusion ! If we dd the files out of Gluster
we will receive the amount of data that 'ls' showed us.

In the brick directory (incl .shard directory) we have the expected amount
of shards for a 25G files (200) with size precisely equal to 128MB - but
there is an additional 0 size shard file created.

Has anyone else seen a phenomenon like this ?

Thanx,

Claus.

-- 
*Claus Jeppesen*
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
www.datto.com




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] sharding in glusterfs

2018-10-05 Thread Krutika Dhananjay
Hi,

Apologies for the late reply. My email filters are messed up, I missed
reading this.

Answers to questions around shard algorithm inline ...

On Sun, Sep 30, 2018 at 9:54 PM Ashayam Gupta 
wrote:

> Hi Pranith,
>
> Thanks for you reply, it would be helpful if you can please help us with
> the following issues with respect to sharding.
> The gluster version we are using is *glusterfs 4.1.4 *on Ubuntu 18.04.1
> LTS
>
>
>- *Shards-Creation Algo*: We were interested in understanding the way
>in which shards are distributed across bricks and nodes, is it Round-Robin
>or some other algo and can we change this mechanism using some config file.
>E.g. If we have 2 nodes with each nodes having 2 bricks , with a total
>of 4 (2*2) bricks how will the shards be distributed, will it be always
>even distribution?(Volume type in this case is plain)
>
>-  *Sharding+Distributed-Volume*: Currently we are using plain volume
>with sharding enabled and we do not see even distribution of shards across
>bricks .Can we use sharding with distributed volume to achieve evenly and
>better distribution of shards? Would be helpful if you can suggest the most
>efficient way of using sharding , our goal is to have a evenly distributed
>file system(we have large files hence using sharding) and we are not
>concerned with replication as of now.
>
> I think Raghavendra already answered the two questions above.

>
>- *Shard-Block-Size: *In case we change the
>* features.shard-block-size* value from X -> Y after lots of data has
>been populated , how does this affect the existing shards are they auto
>corrected as per the new size or do we need to run some commdands to get
>this done or is this even recommended to do the change?
>
> Existing files will retain their shard-block-size. shard-block-size is a
property of a file that is set at the time of creation of the file (in the
form of an extended attribute "trusted.glusterfs.shard.block-size") and
remains same through the lifetime of the file.

If you want the shard-block-size to be changed across these files, you'll
need to perform either of the two steps below:

1. move the existing files to a local fs from your glusterfs volume and
then move them back into the volume.
2. copy the existing files into a temporary filenames on the same volume
and rename them back to their original names.
In our tests wrt vm store workload, we've found 64MB shard-block-size to be
good fit for both IO and self-heal performance.


>- *Rebalance-Shard*: As per the docs whenever we add new server/node
>to the existing gluster we need to run Rebalance command, we would like to
>know if there are any known issues for re-balancing with sharding enabled.
>
> We did find some shard-dht inter-op issues in rebalance in the past again
in the supported vm storage use-case. The good news is that the problems
known to us have been fixed, but their validation is still pending.


> We would highly appreciate if you can point us to the latest sharding
> docs, we tried to search but could not find better than this
> https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/shard/
> .
>

The doc is still valid (except for minor changes in the To-Do list at the
bottom). But I agree, the answers to all of the questions you asked above
are well worth documenting. I'll fix this. Thanks for the feedback.
Let us know if you have any more questions or if you run into any problems.
Happy to help.
Also, since you're using a non-vm storage use case, I'd suggest that you
try shard on a test cluster first before even putting it into production. :)

-Krutika


> Thanks
> Ashayam
>
>
> On Thu, Sep 20, 2018 at 7:47 PM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Wed, Sep 19, 2018 at 11:37 AM Ashayam Gupta <
>> ashayam.gu...@alpha-grep.com> wrote:
>>
>>> Please find our workload details as requested by you :
>>>
>>> * Only 1 write-mount point as of now
>>> * Read-Mount : Since we auto-scale our machines this can be as big as
>>> 300-400 machines during peak times
>>> * >" multiple concurrent reads means that Reads will not happen until
>>> the file is completely written to"  Yes , in our current scenario we can
>>> ensure that indeed this is the case.
>>>
>>> But when you say it only supports single writer workload we would like
>>> to understand the following scenarios with respect to multiple writers and
>>> the current behaviour of glusterfs with sharding
>>>
>>>- Multiple Writer writes to different files
>>>
>>> When I say multiple writers, I mean multiple mounts. Since you were
>> saying earlier there is only one mount which does all writes, everything
>> should work as expected.
>>
>>>
>>>- Multiple Writer writes to same file
>>>   - they write to same file but different shards of same file
>>>   - they write to same file (no gurantee if they write to different
>>>   shards)
>>>
>>> As long as 

Re: [Gluster-users] sharding in glusterfs

2018-10-02 Thread Raghavendra Gowdappa
On Sun, Sep 30, 2018 at 9:54 PM Ashayam Gupta 
wrote:

> Hi Pranith,
>
> Thanks for you reply, it would be helpful if you can please help us with
> the following issues with respect to sharding.
> The gluster version we are using is *glusterfs 4.1.4 *on Ubuntu 18.04.1
> LTS
>
>
>- *Shards-Creation Algo*: We were interested in understanding the way
>in which shards are distributed across bricks and nodes, is it Round-Robin
>or some other algo and can we change this mechanism using some config file.
>E.g. If we have 2 nodes with each nodes having 2 bricks , with a total
>of 4 (2*2) bricks how will the shards be distributed, will it be always
>even distribution?(Volume type in this case is plain)
>
>-  *Sharding+Distributed-Volume*: Currently we are using plain volume
>with sharding enabled and we do not see even distribution of shards across
>bricks .Can we use sharding with distributed volume to achieve evenly and
>better distribution of shards? Would be helpful if you can suggest the most
>efficient way of using sharding , our goal is to have a evenly distributed
>file system(we have large files hence using sharding) and we are not
>concerned with replication as of now.
>
>
For distribution you need DHT as a descendant of shard xlator in graph. The
way features/shard xlator handles sharding of file is to create an
independent file for each shard and hence an individual shard is visible as
an independent file in children of xlator shard. The entire distribution
logic is off-loaded to the xlator that handles distribution logic.


>
>- *Shard-Block-Size: *In case we change the
>* features.shard-block-size* value from X -> Y after lots of data has
>been populated , how does this affect the existing shards are they auto
>corrected as per the new size or do we need to run some commdands to get
>this done or is this even recommended to do the change?
>- *Rebalance-Shard*: As per the docs whenever we add new server/node
>to the existing gluster we need to run Rebalance command, we would like to
>know if there are any known issues for re-balancing with sharding enabled.
>
> We would highly appreciate if you can point us to the latest sharding
> docs, we tried to search but could not find better than this
> https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/shard/
> .
>
> Thanks
> Ashayam
>
>
> On Thu, Sep 20, 2018 at 7:47 PM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Wed, Sep 19, 2018 at 11:37 AM Ashayam Gupta <
>> ashayam.gu...@alpha-grep.com> wrote:
>>
>>> Please find our workload details as requested by you :
>>>
>>> * Only 1 write-mount point as of now
>>> * Read-Mount : Since we auto-scale our machines this can be as big as
>>> 300-400 machines during peak times
>>> * >" multiple concurrent reads means that Reads will not happen until
>>> the file is completely written to"  Yes , in our current scenario we can
>>> ensure that indeed this is the case.
>>>
>>> But when you say it only supports single writer workload we would like
>>> to understand the following scenarios with respect to multiple writers and
>>> the current behaviour of glusterfs with sharding
>>>
>>>- Multiple Writer writes to different files
>>>
>>> When I say multiple writers, I mean multiple mounts. Since you were
>> saying earlier there is only one mount which does all writes, everything
>> should work as expected.
>>
>>>
>>>- Multiple Writer writes to same file
>>>   - they write to same file but different shards of same file
>>>   - they write to same file (no gurantee if they write to different
>>>   shards)
>>>
>>> As long as the above happens from same mount, things should be fine.
>> Otherwise there could be problems.
>>
>>
>>> There might be some more cases which are known to you , would be helpful
>>> if you can describe us about those scenarios as well or may point us to the
>>> relevant documents.
>>>
>> Also it would be helpful if you can suggest the most stable version of
>>> glusterfs with sharding feature to use , since we would like to use this in
>>> production.
>>>
>>
>> It has been stable for a while, so use any of the latest maintained
>> releases like 3.12.x or 4.1.x
>>
>> As I was mentioning already, sharding is mainly tested with
>> VM/gluster-block workloads. So there could be some corner cases with single
>> writer workload which we never ran into for the VM/block workloads we test.
>> But you may run into them. Do let us know and we can take a look if you
>> find something out of the ordinary. What I would suggest is to use one of
>> the maintained releases and run the workloads you have for some time to
>> test things out, once you feel confident, you can put it in production.
>>
>> HTH
>>
>>>
>>> Thanks
>>> Ashayam Gupta
>>>
>>> On Tue, Sep 18, 2018 at 11:00 AM Pranith Kumar Karampuri <
>>> pkara...@redhat.com> wrote:
>>>


 On Mon, Sep 17, 2018 at 

Re: [Gluster-users] sharding in glusterfs

2018-09-30 Thread Ashayam Gupta
Hi Pranith,

Thanks for you reply, it would be helpful if you can please help us with
the following issues with respect to sharding.
The gluster version we are using is *glusterfs 4.1.4 *on Ubuntu 18.04.1 LTS


   - *Shards-Creation Algo*: We were interested in understanding the way in
   which shards are distributed across bricks and nodes, is it Round-Robin or
   some other algo and can we change this mechanism using some config file.
   E.g. If we have 2 nodes with each nodes having 2 bricks , with a total
   of 4 (2*2) bricks how will the shards be distributed, will it be always
   even distribution?(Volume type in this case is plain)

   -  *Sharding+Distributed-Volume*: Currently we are using plain volume
   with sharding enabled and we do not see even distribution of shards across
   bricks .Can we use sharding with distributed volume to achieve evenly and
   better distribution of shards? Would be helpful if you can suggest the most
   efficient way of using sharding , our goal is to have a evenly distributed
   file system(we have large files hence using sharding) and we are not
   concerned with replication as of now.
   - *Shard-Block-Size: *In case we change the *
features.shard-block-size* value
   from X -> Y after lots of data has been populated , how does this affect
   the existing shards are they auto corrected as per the new size or do we
   need to run some commdands to get this done or is this even recommended to
   do the change?
   - *Rebalance-Shard*: As per the docs whenever we add new server/node to
   the existing gluster we need to run Rebalance command, we would like to
   know if there are any known issues for re-balancing with sharding enabled.

We would highly appreciate if you can point us to the latest sharding docs,
we tried to search but could not find better than this
https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/shard/
.

Thanks
Ashayam


On Thu, Sep 20, 2018 at 7:47 PM Pranith Kumar Karampuri 
wrote:

>
>
> On Wed, Sep 19, 2018 at 11:37 AM Ashayam Gupta <
> ashayam.gu...@alpha-grep.com> wrote:
>
>> Please find our workload details as requested by you :
>>
>> * Only 1 write-mount point as of now
>> * Read-Mount : Since we auto-scale our machines this can be as big as
>> 300-400 machines during peak times
>> * >" multiple concurrent reads means that Reads will not happen until the
>> file is completely written to"  Yes , in our current scenario we can ensure
>> that indeed this is the case.
>>
>> But when you say it only supports single writer workload we would like to
>> understand the following scenarios with respect to multiple writers and the
>> current behaviour of glusterfs with sharding
>>
>>- Multiple Writer writes to different files
>>
>> When I say multiple writers, I mean multiple mounts. Since you were
> saying earlier there is only one mount which does all writes, everything
> should work as expected.
>
>>
>>- Multiple Writer writes to same file
>>   - they write to same file but different shards of same file
>>   - they write to same file (no gurantee if they write to different
>>   shards)
>>
>> As long as the above happens from same mount, things should be fine.
> Otherwise there could be problems.
>
>
>> There might be some more cases which are known to you , would be helpful
>> if you can describe us about those scenarios as well or may point us to the
>> relevant documents.
>>
> Also it would be helpful if you can suggest the most stable version of
>> glusterfs with sharding feature to use , since we would like to use this in
>> production.
>>
>
> It has been stable for a while, so use any of the latest maintained
> releases like 3.12.x or 4.1.x
>
> As I was mentioning already, sharding is mainly tested with
> VM/gluster-block workloads. So there could be some corner cases with single
> writer workload which we never ran into for the VM/block workloads we test.
> But you may run into them. Do let us know and we can take a look if you
> find something out of the ordinary. What I would suggest is to use one of
> the maintained releases and run the workloads you have for some time to
> test things out, once you feel confident, you can put it in production.
>
> HTH
>
>>
>> Thanks
>> Ashayam Gupta
>>
>> On Tue, Sep 18, 2018 at 11:00 AM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Mon, Sep 17, 2018 at 4:14 AM Ashayam Gupta <
>>> ashayam.gu...@alpha-grep.com> wrote:
>>>
 Hi All,

 We are currently using glusterfs for storing large files with
 write-once and multiple concurrent reads, and were interested in
 understanding one of the features of glusterfs called sharding for our use
 case.

 So far from the talk given by the developer [
 https://www.youtube.com/watch?v=aAlLy9k65Gw] and the git issue [
 https://github.com/gluster/glusterfs/issues/290] , we know that it was
 developed for large VM images as use case and the second link does 

Re: [Gluster-users] sharding in glusterfs

2018-09-20 Thread Pranith Kumar Karampuri
On Wed, Sep 19, 2018 at 11:37 AM Ashayam Gupta 
wrote:

> Please find our workload details as requested by you :
>
> * Only 1 write-mount point as of now
> * Read-Mount : Since we auto-scale our machines this can be as big as
> 300-400 machines during peak times
> * >" multiple concurrent reads means that Reads will not happen until the
> file is completely written to"  Yes , in our current scenario we can ensure
> that indeed this is the case.
>
> But when you say it only supports single writer workload we would like to
> understand the following scenarios with respect to multiple writers and the
> current behaviour of glusterfs with sharding
>
>- Multiple Writer writes to different files
>
> When I say multiple writers, I mean multiple mounts. Since you were saying
earlier there is only one mount which does all writes, everything should
work as expected.

>
>- Multiple Writer writes to same file
>   - they write to same file but different shards of same file
>   - they write to same file (no gurantee if they write to different
>   shards)
>
> As long as the above happens from same mount, things should be fine.
Otherwise there could be problems.


> There might be some more cases which are known to you , would be helpful
> if you can describe us about those scenarios as well or may point us to the
> relevant documents.
>
Also it would be helpful if you can suggest the most stable version of
> glusterfs with sharding feature to use , since we would like to use this in
> production.
>

It has been stable for a while, so use any of the latest maintained
releases like 3.12.x or 4.1.x

As I was mentioning already, sharding is mainly tested with
VM/gluster-block workloads. So there could be some corner cases with single
writer workload which we never ran into for the VM/block workloads we test.
But you may run into them. Do let us know and we can take a look if you
find something out of the ordinary. What I would suggest is to use one of
the maintained releases and run the workloads you have for some time to
test things out, once you feel confident, you can put it in production.

HTH

>
> Thanks
> Ashayam Gupta
>
> On Tue, Sep 18, 2018 at 11:00 AM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Mon, Sep 17, 2018 at 4:14 AM Ashayam Gupta <
>> ashayam.gu...@alpha-grep.com> wrote:
>>
>>> Hi All,
>>>
>>> We are currently using glusterfs for storing large files with write-once
>>> and multiple concurrent reads, and were interested in understanding one of
>>> the features of glusterfs called sharding for our use case.
>>>
>>> So far from the talk given by the developer [
>>> https://www.youtube.com/watch?v=aAlLy9k65Gw] and the git issue [
>>> https://github.com/gluster/glusterfs/issues/290] , we know that it was
>>> developed for large VM images as use case and the second link does talk
>>> about a more general purpose usage , but we are not clear if there are some
>>> issues if used for non-VM image large files [which is the use case for us].
>>>
>>> Therefore it would be helpful if we can have some pointers or more
>>> information about the more general use-case scenario for sharding and any
>>> shortcomings if any , in case we use it for our scenario which is non-VM
>>> large files with write-once and multiple concurrent reads.Also it would be
>>> very helpful if you can suggest the best approach/settings for our use case
>>> scenario.
>>>
>>
>> Sharding is developed for Big file usecases and at the moment only
>> supports single writer workload. I also added the maintainers for sharding
>> to the thread. May be giving a bit of detail about access pattern w.r.t.
>> number of mounts that are used for writing/reading would be helpful. I am
>> assuming write-once and multiple concurrent reads means that Reads will not
>> happen until the file is completely written to. Could you explain  a bit
>> more about the workload?
>>
>>
>>>
>>> Thanks
>>> Ashayam Gupta
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>> --
>> Pranith
>>
>

-- 
Pranith
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] sharding in glusterfs

2018-09-19 Thread Ashayam Gupta
Please find our workload details as requested by you :

* Only 1 write-mount point as of now
* Read-Mount : Since we auto-scale our machines this can be as big as
300-400 machines during peak times
* >" multiple concurrent reads means that Reads will not happen until the
file is completely written to"  Yes , in our current scenario we can ensure
that indeed this is the case.

But when you say it only supports single writer workload we would like to
understand the following scenarios with respect to multiple writers and the
current behaviour of glusterfs with sharding

   - Multiple Writer writes to different files
   - Multiple Writer writes to same file
  - they write to same file but different shards of same file
  - they write to same file (no gurantee if they write to different
  shards)

There might be some more cases which are known to you , would be helpful if
you can describe us about those scenarios as well or may point us to the
relevant documents.
Also it would be helpful if you can suggest the most stable version of
glusterfs with sharding feature to use , since we would like to use this in
production.

Thanks
Ashayam Gupta

On Tue, Sep 18, 2018 at 11:00 AM Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Mon, Sep 17, 2018 at 4:14 AM Ashayam Gupta <
> ashayam.gu...@alpha-grep.com> wrote:
>
>> Hi All,
>>
>> We are currently using glusterfs for storing large files with write-once
>> and multiple concurrent reads, and were interested in understanding one of
>> the features of glusterfs called sharding for our use case.
>>
>> So far from the talk given by the developer [
>> https://www.youtube.com/watch?v=aAlLy9k65Gw] and the git issue [
>> https://github.com/gluster/glusterfs/issues/290] , we know that it was
>> developed for large VM images as use case and the second link does talk
>> about a more general purpose usage , but we are not clear if there are some
>> issues if used for non-VM image large files [which is the use case for us].
>>
>> Therefore it would be helpful if we can have some pointers or more
>> information about the more general use-case scenario for sharding and any
>> shortcomings if any , in case we use it for our scenario which is non-VM
>> large files with write-once and multiple concurrent reads.Also it would be
>> very helpful if you can suggest the best approach/settings for our use case
>> scenario.
>>
>
> Sharding is developed for Big file usecases and at the moment only
> supports single writer workload. I also added the maintainers for sharding
> to the thread. May be giving a bit of detail about access pattern w.r.t.
> number of mounts that are used for writing/reading would be helpful. I am
> assuming write-once and multiple concurrent reads means that Reads will not
> happen until the file is completely written to. Could you explain  a bit
> more about the workload?
>
>
>>
>> Thanks
>> Ashayam Gupta
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> --
> Pranith
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] sharding in glusterfs

2018-09-17 Thread Pranith Kumar Karampuri
On Mon, Sep 17, 2018 at 4:14 AM Ashayam Gupta 
wrote:

> Hi All,
>
> We are currently using glusterfs for storing large files with write-once
> and multiple concurrent reads, and were interested in understanding one of
> the features of glusterfs called sharding for our use case.
>
> So far from the talk given by the developer [
> https://www.youtube.com/watch?v=aAlLy9k65Gw] and the git issue [
> https://github.com/gluster/glusterfs/issues/290] , we know that it was
> developed for large VM images as use case and the second link does talk
> about a more general purpose usage , but we are not clear if there are some
> issues if used for non-VM image large files [which is the use case for us].
>
> Therefore it would be helpful if we can have some pointers or more
> information about the more general use-case scenario for sharding and any
> shortcomings if any , in case we use it for our scenario which is non-VM
> large files with write-once and multiple concurrent reads.Also it would be
> very helpful if you can suggest the best approach/settings for our use case
> scenario.
>

Sharding is developed for Big file usecases and at the moment only supports
single writer workload. I also added the maintainers for sharding to the
thread. May be giving a bit of detail about access pattern w.r.t. number of
mounts that are used for writing/reading would be helpful. I am assuming
write-once and multiple concurrent reads means that Reads will not happen
until the file is completely written to. Could you explain  a bit more
about the workload?


>
> Thanks
> Ashayam Gupta
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users



-- 
Pranith
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] sharding in glusterfs

2018-09-17 Thread Serkan Çoban
Did you try disperse volume? It may work your workload I think.
We are using disperse volumes for archive workloads with 2GB files and
I did not encounter any problems.
On Mon, Sep 17, 2018 at 1:43 AM Ashayam Gupta
 wrote:
>
> Hi All,
>
> We are currently using glusterfs for storing large files with write-once and 
> multiple concurrent reads, and were interested in understanding one of the 
> features of glusterfs called sharding for our use case.
>
> So far from the talk given by the developer 
> [https://www.youtube.com/watch?v=aAlLy9k65Gw] and the git issue 
> [https://github.com/gluster/glusterfs/issues/290] , we know that it was 
> developed for large VM images as use case and the second link does talk about 
> a more general purpose usage , but we are not clear if there are some issues 
> if used for non-VM image large files [which is the use case for us].
>
> Therefore it would be helpful if we can have some pointers or more 
> information about the more general use-case scenario for sharding and any 
> shortcomings if any , in case we use it for our scenario which is non-VM 
> large files with write-once and multiple concurrent reads.Also it would be 
> very helpful if you can suggest the best approach/settings for our use case 
> scenario.
>
> Thanks
> Ashayam Gupta
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] sharding in glusterfs

2018-09-16 Thread Ashayam Gupta
Hi All,

We are currently using glusterfs for storing large files with write-once
and multiple concurrent reads, and were interested in understanding one of
the features of glusterfs called sharding for our use case.

So far from the talk given by the developer [
https://www.youtube.com/watch?v=aAlLy9k65Gw] and the git issue [
https://github.com/gluster/glusterfs/issues/290] , we know that it was
developed for large VM images as use case and the second link does talk
about a more general purpose usage , but we are not clear if there are some
issues if used for non-VM image large files [which is the use case for us].

Therefore it would be helpful if we can have some pointers or more
information about the more general use-case scenario for sharding and any
shortcomings if any , in case we use it for our scenario which is non-VM
large files with write-once and multiple concurrent reads.Also it would be
very helpful if you can suggest the best approach/settings for our use case
scenario.

Thanks
Ashayam Gupta
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

2018-04-06 Thread Ian Halliday

Raghavendra,

Thanks! I'll get you this info within the next few days and will file a 
bug report at the same time.


For what its worth, we were able to reproduce the issue on a completely 
new cluster running 3.13. The IO pattern that most easily causes it to 
fail is a VM image format with XFS. Formatting VMS with Ext4 will create 
the additional shard files, but the GFIDs will usually match. I'm not 
sure if there are supposed to be 2 identical shard filenames, with one 
being empty, but they don't seem to cause VMs to pause or fail when the 
GFID matches.


Both of these clusters are pure SSD (one replica 3 arbiter 1, the other 
replica 3). I haven't seen any issues with our non-SSD clusters yet, but 
they aren't pushed as hard.


Ian

-- Original Message --
From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Ian Halliday" <ihalli...@ndevix.com>
Cc: "Krutika Dhananjay" <kdhan...@redhat.com>; "gluster-user" 
<gluster-users@gluster.org>; "Nithya Balachandran" <nbala...@redhat.com>

Sent: 4/5/2018 10:39:47 PM
Subject: Re: Re[2]: [Gluster-users] Sharding problem - multiple shard 
copies with mismatching gfids



Sorry for the delay, Ian :).

This looks to be a genuine issue which requires some effort in fixing 
it. Can you file a bug? I need following information attached to bug:


* Client and bricks logs. If you can reproduce the issue, please set 
diagnostics.client-log-level and diagnostics.brick-log-level to TRACE. 
If you cannot reproduce the issue or if you cannot accommodate such big 
logs, please set the log-level to DEBUG.
* If possible a simple reproducer. A simple script or steps are 
appreciated.
* strace of VM (to find out I/O pattern). If possible, dump of traffic 
between kernel and glusterfs. This can be captured by mounting 
glusterfs using --dump-fuse option.


Note that the logs you've posted here captures the scenario _after_ the 
shard file has gone into bad state. But I need information on what led 
to that situation. So, please start collecting this diagnostic 
information as early as you can.


regards,
Raghavendra

On Tue, Apr 3, 2018 at 7:52 AM, Ian Halliday <ihalli...@ndevix.com> 
wrote:

Raghavendra,

Sorry for the late follow up. I have some more data on the issue.

The issue tends to happen when the shards are created. The easiest 
time to reproduce this is during an initial VM disk format. This is a 
log from a test VM that was launched, and then partitioned and 
formatted with LVM / XFS:


[2018-04-03 02:05:00.838440] W [MSGID: 109048] 
[dht-common.c:9732:dht_rmdir_cached_lookup_cbk] 0-ovirt-350-zone1-dht: 
/489c6fb7-fe61-4407-8160-35c0aac40c85/images/_remove_me_9a0660e1-bd86-47ea-8e09-865c14f11f26/e2645bd1-a7f3-4cbd-9036-3d3cbc7204cd.meta 
found on cached subvol ovirt-350-zone1-replicate-5
[2018-04-03 02:07:57.967489] I [MSGID: 109070] 
[dht-common.c:2796:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht: 
Lookup of /.shard/927c6620-848b-4064-8c88-68a332b645c2.7 on 
ovirt-350-zone1-replicate-3 (following linkfile) failed ,gfid = 
---- [No such file or directory]
[2018-04-03 02:07:57.974815] I [MSGID: 109069] 
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk] 
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for 
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
[2018-04-03 02:07:57.979851] W [MSGID: 109009] 
[dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht: 
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid different on data 
file on ovirt-350-zone1-replicate-3, gfid local = 
----, gfid node = 
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.980716] W [MSGID: 109009] 
[dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht: 
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on 
subvolume ovirt-350-zone1-replicate-3, gfid local = 
b1e3f299-32ff-497e-918b-090e957090f6, gfid node = 
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.980763] E [MSGID: 133010] 
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard: 
Lookup on shard 3 failed. Base file gfid = 
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.983016] I [MSGID: 109069] 
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk] 
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for 
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
[2018-04-03 02:07:57.988761] W [MSGID: 109009] 
[dht-common.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht: 
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on 
subvolume ovirt-350-zone1-replicate-3, gfid local = 
b1e3f299-32ff-497e-918b-090e957090f6, gfid node = 
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.988844] W [MSGID: 109009] 
[dht-common.c:2831:dht_lookup_linkfile_cbk] 0-ovirt-350-zone1-dht: 
/.shard/927c6620-848b-4064-8c88-68a332b645c2.7: gfid differe

Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

2018-04-05 Thread Raghavendra Gowdappa
rick1: 10.0.6.100:/gluster/brick1/brick
> Brick2: 10.0.6.101:/gluster/brick1/brick
> Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
> Brick4: 10.0.6.100:/gluster/brick2/brick
> Brick5: 10.0.6.101:/gluster/brick2/brick
> Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
> Brick7: 10.0.6.100:/gluster/brick3/brick
> Brick8: 10.0.6.101:/gluster/brick3/brick
> Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
> Brick10: 10.0.6.100:/gluster/brick4/brick
> Brick11: 10.0.6.101:/gluster/brick4/brick
> Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
> Brick13: 10.0.6.100:/gluster/brick5/brick
> Brick14: 10.0.6.101:/gluster/brick5/brick
> Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
> Brick16: 10.0.6.100:/gluster/brick6/brick
> Brick17: 10.0.6.101:/gluster/brick6/brick
> Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
> Brick19: 10.0.6.100:/gluster/brick7/brick
> Brick20: 10.0.6.101:/gluster/brick7/brick
> Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
> Options Reconfigured:
> cluster.server-quorum-type: server
> cluster.data-self-heal-algorithm: full
> performance.client-io-threads: off
> server.allow-insecure: on
> client.event-threads: 8
> storage.owner-gid: 36
> storage.owner-uid: 36
> server.event-threads: 16
> features.shard-block-size: 5GB
> features.shard: on
> transport.address-family: inet
> nfs.disable: yes
>
> Any suggestions?
>
>
> -- Ian
>
>
> -- Original Message --
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Krutika Dhananjay" <kdhan...@redhat.com>
> Cc: "Ian Halliday" <ihalli...@ndevix.com>; "gluster-user" <
> gluster-users@gluster.org>; "Nithya Balachandran" <nbala...@redhat.com>
> Sent: 3/26/2018 2:37:21 AM
> Subject: Re: [Gluster-users] Sharding problem - multiple shard copies with
> mismatching gfids
>
> Ian,
>
> Do you've a reproducer for this bug? If not a specific one, a general
> outline of what operations where done on the file will help.
>
> regards,
> Raghavendra
>
> On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa <
> rgowd...@redhat.com> wrote:
>
>>
>>
>> On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay <kdhan...@redhat.com>
>> wrote:
>>
>>> The gfid mismatch here is between the shard and its "link-to" file, the
>>> creation of which happens at a layer below that of shard translator on the
>>> stack.
>>>
>>> Adding DHT devs to take a look.
>>>
>>
>> Thanks Krutika. I assume shard doesn't do any dentry operations like
>> rename, link, unlink on the path of file (not the gfid handle based path)
>> internally while managing shards. Can you confirm? If it does these
>> operations, what fops does it do?
>>
>> @Ian,
>>
>> I can suggest following way to fix the problem:
>> * Since one of files listed is a DHT linkto file, I am assuming there is
>> only one shard of the file. If not, please list out gfids of other shards
>> and don't proceed with healing procedure.
>> * If gfids of all shards happen to be same and only linkto has a
>> different gfid, please proceed to step 3. Otherwise abort the healing
>> procedure.
>> * If cluster.lookup-optimize is set to true abort the healing procedure
>> * Delete the linkto file - the file with permissions ---T and xattr
>> trusted.dht.linkto and do a lookup on the file from mount point after
>> turning off readdriplus [1].
>>
>> As to reasons on how we ended up in this situation, Can you explain me
>> what is the I/O pattern on this file - like are there lots of entry
>> operations like rename, link, unlink etc on the file? There have been known
>> races in rename/lookup-heal-creating-linkto where linkto and data file
>> have different gfids. [2] fixes some of these cases
>>
>> [1] http://lists.gluster.org/pipermail/gluster-users/2017-March/
>> 030148.html
>> [2] https://review.gluster.org/#/c/19547/
>>
>> regards,
>> Raghavendra
>>
>>>
>>>
>>>> -Krutika
>>>
>>> On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday <ihalli...@ndevix.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> We are having a rather interesting problem with one of our VM storage
>>>> systems. The GlusterFS client is throwing errors relating to GFID
>>>> mismatches. We traced this down to multiple shards being present on the
>>>> gluster nodes, with different gfids.
>>>>
>>>> Hypervisor gluster mount log:
&

Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

2018-04-02 Thread Ian Halliday
mon.c:2570:dht_lookup_everywhere_cbk] 0-ovirt-350-zone1-dht: 
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3: gfid differs on 
subvolume ovirt-350-zone1-replicate-3, gfid local = 
0a701104-e9a2-44c0-8181-4a9a6edecf9f, gfid node = 
55f86aa0-e7a0-4075-b46b-a11f8bdbbceb
[2018-04-03 02:07:57.999899] E [MSGID: 133010] 
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-350-zone1-shard: 
Lookup on shard 3 failed. Base file gfid = 
927c6620-848b-4064-8c88-68a332b645c2 [Stale file handle]
[2018-04-03 02:07:57.42] W [fuse-bridge.c:896:fuse_attr_cbk] 
0-glusterfs-fuse: 22338: FSTAT() 
/489c6fb7-fe61-4407-8160-35c0aac40c85/images/a717e25c-f108-4367-9d28-9235bd432bb7/5a8e541e-8883-4dec-8afd-aa29f38ef502 
=> -1 (Stale file handle)
[2018-04-03 02:07:57.987941] I [MSGID: 109069] 
[dht-common.c:2095:dht_lookup_unlink_stale_linkto_cbk] 
0-ovirt-350-zone1-dht: Returned with op_ret 0 and op_errno 0 for 
/.shard/927c6620-848b-4064-8c88-68a332b645c2.3



Duplicate shards are created. Output from one of the gluster nodes:

# find -name 927c6620-848b-4064-8c88-68a332b645c2.*
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick3/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.9
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.5
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.3
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.7

[root@n1 gluster]# getfattr -d -m . -e hex 
./brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19

# file: brick1/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f39323763363632302d383438622d343036342d386338382d363861326236343563322e3139
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300

[root@n1 gluster]# getfattr -d -m . -e hex 
./brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19

# file: brick4/brick/.shard/927c6620-848b-4064-8c88-68a332b645c2.19
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x
trusted.gfid=0x46083184a0e5468e89e6cc1db0bfc63b
trusted.gfid2path.77528eefc6a11c45=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f39323763363632302d383438622d343036342d386338382d363861326236343563322e3139


In the above example, the shard on Brick 1 is the bad one.

At this point, the VM will pause with an unknown storage error and will 
not boot until the offending shards are removed.



# gluster volume info
Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Bricks:
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
Options Reconfigured:
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.client-io-threads: off
server.allow-insecure: on
client.event-threads: 8
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 16
features.shard-block-size: 5GB
features.shard: on
transport.address-family: inet
nfs.disable: yes

Any suggestions?


-- Ian


-- Original Message --
From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Krutika Dhananjay" <kdhan...@redhat.com>
Cc: "Ian Halliday" <ihalli...@ndevix.com>; "gluster-user" 
<gluster-users@gluster.org>; "Nithya Balachandran" <nbala...@redhat.com>

Sent: 3/26/2018 2:37:21 AM
Subject: Re: [Gluster-users] Sharding problem - multiple shard copie

Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

2018-03-26 Thread Ian Halliday

Raghavenda,

The issue typically appears during heavy write operations to the VM 
image. Its most noticeable during the filesystem creation process on a 
virtual machine image. I'll get some specific data while executing that 
process and will get back to you soon.


thanks


-- Ian

-- Original Message --
From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Krutika Dhananjay" <kdhan...@redhat.com>
Cc: "Ian Halliday" <ihalli...@ndevix.com>; "gluster-user" 
<gluster-users@gluster.org>; "Nithya Balachandran" <nbala...@redhat.com>

Sent: 3/26/2018 2:37:21 AM
Subject: Re: [Gluster-users] Sharding problem - multiple shard copies 
with mismatching gfids



Ian,

Do you've a reproducer for this bug? If not a specific one, a general 
outline of what operations where done on the file will help.


regards,
Raghavendra

On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa 
<rgowd...@redhat.com> wrote:



On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay 
<kdhan...@redhat.com> wrote:
The gfid mismatch here is between the shard and its "link-to" file, 
the creation of which happens at a layer below that of shard 
translator on the stack.


Adding DHT devs to take a look.


Thanks Krutika. I assume shard doesn't do any dentry operations like 
rename, link, unlink on the path of file (not the gfid handle based 
path) internally while managing shards. Can you confirm? If it does 
these operations, what fops does it do?


@Ian,

I can suggest following way to fix the problem:
* Since one of files listed is a DHT linkto file, I am assuming there 
is only one shard of the file. If not, please list out gfids of other 
shards and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a 
different gfid, please proceed to step 3. Otherwise abort the healing 
procedure.
* If cluster.lookup-optimize is set to true abort the healing 
procedure
* Delete the linkto file - the file with permissions ---T and 
xattr trusted.dht.linkto and do a lookup on the file from mount point 
after turning off readdriplus [1].


As to reasons on how we ended up in this situation, Can you explain me 
what is the I/O pattern on this file - like are there lots of entry 
operations like rename, link, unlink etc on the file? There have been 
known races in rename/lookup-heal-creating-linkto where linkto and 
data file have different gfids. [2] fixes some of these cases


[1] 
http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html 
<http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html>
[2] https://review.gluster.org/#/c/19547/ 
<https://review.gluster.org/#/c/19547/>


regards,
Raghavendra





-Krutika

On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday <ihalli...@ndevix.com> 
wrote:

Hello all,

We are having a rather interesting problem with one of our VM 
storage systems. The GlusterFS client is throwing errors relating to 
GFID mismatches. We traced this down to multiple shards being 
present on the gluster nodes, with different gfids.


Hypervisor gluster mount log:

[2018-03-25 18:54:19.261733] E [MSGID: 133010] 
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard: 
Lookup on shard 7 failed. Base file gfid = 
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
The message "W [MSGID: 109009] 
[dht-common.c:2162:dht_lookup_linkfile_cbk] 0-ovirt-zone1-dht: 
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on 
data file on ovirt-zone1-replicate-3, gfid local = 
----, gfid node = 
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between 
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009] 
[dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht: 
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on 
subvolume ovirt-zone1-replicate-3, gfid local = 
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node = 
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56



On the storage nodes, we found this:

[root@n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

[root@n1 gluster]# ls -lh 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-T. 2 root root 0 Mar 25 13:55 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
[root@n1 gluster]# ls -lh 
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw. 2 root root 3.8G Mar 25 13:55 
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7


[root@n1 gluster]# getfattr -d -m . -e hex 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5

Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

2018-03-26 Thread Raghavendra Gowdappa
Ian,

Do you've a reproducer for this bug? If not a specific one, a general
outline of what operations where done on the file will help.

regards,
Raghavendra

On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa 
wrote:

>
>
> On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay 
> wrote:
>
>> The gfid mismatch here is between the shard and its "link-to" file, the
>> creation of which happens at a layer below that of shard translator on the
>> stack.
>>
>> Adding DHT devs to take a look.
>>
>
> Thanks Krutika. I assume shard doesn't do any dentry operations like
> rename, link, unlink on the path of file (not the gfid handle based path)
> internally while managing shards. Can you confirm? If it does these
> operations, what fops does it do?
>
> @Ian,
>
> I can suggest following way to fix the problem:
> * Since one of files listed is a DHT linkto file, I am assuming there is
> only one shard of the file. If not, please list out gfids of other shards
> and don't proceed with healing procedure.
> * If gfids of all shards happen to be same and only linkto has a different
> gfid, please proceed to step 3. Otherwise abort the healing procedure.
> * If cluster.lookup-optimize is set to true abort the healing procedure
> * Delete the linkto file - the file with permissions ---T and xattr
> trusted.dht.linkto and do a lookup on the file from mount point after
> turning off readdriplus [1].
>
> As to reasons on how we ended up in this situation, Can you explain me
> what is the I/O pattern on this file - like are there lots of entry
> operations like rename, link, unlink etc on the file? There have been known
> races in rename/lookup-heal-creating-linkto where linkto and data file
> have different gfids. [2] fixes some of these cases
>
> [1] http://lists.gluster.org/pipermail/gluster-users/2017-
> March/030148.html
> [2] https://review.gluster.org/#/c/19547/
>
> regards,
> Raghavendra
>
>>
>>
>>> -Krutika
>>
>> On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday 
>> wrote:
>>
>>> Hello all,
>>>
>>> We are having a rather interesting problem with one of our VM storage
>>> systems. The GlusterFS client is throwing errors relating to GFID
>>> mismatches. We traced this down to multiple shards being present on the
>>> gluster nodes, with different gfids.
>>>
>>> Hypervisor gluster mount log:
>>>
>>> [2018-03-25 18:54:19.261733] E [MSGID: 133010]
>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard:
>>> Lookup on shard 7 failed. Base file gfid = 
>>> 87137cac-49eb-492a-8f33-8e33470d8cb7
>>> [Stale file handle]
>>> The message "W [MSGID: 109009] [dht-common.c:2162:dht_lookup_linkfile_cbk]
>>> 0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
>>> different on data file on ovirt-zone1-replicate-3, gfid local =
>>> ----, gfid node =
>>> 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
>>> [2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
>>> [2018-03-25 18:54:19.264349] W [MSGID: 109009]
>>> [dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht:
>>> /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
>>> subvolume ovirt-zone1-replicate-3, gfid local =
>>> fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
>>> 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
>>>
>>>
>>> On the storage nodes, we found this:
>>>
>>> [root@n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> ./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> ./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>
>>> [root@n1 gluster]# ls -lh ./brick2/brick/.shard/87137cac
>>> -49eb-492a-8f33-8e33470d8cb7.7
>>> -T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/87137cac
>>> -49eb-492a-8f33-8e33470d8cb7.7
>>> [root@n1 gluster]# ls -lh ./brick4/brick/.shard/87137cac
>>> -49eb-492a-8f33-8e33470d8cb7.7
>>> -rw-rw. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/87137cac
>>> -49eb-492a-8f33-8e33470d8cb7.7
>>>
>>> [root@n1 gluster]# getfattr -d -m . -e hex
>>> ./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> # file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>>> c6162656c65645f743a733000
>>> trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
>>> trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e653
>>> 12d7265706c69636174652d3300
>>>
>>> [root@n1 gluster]# getfattr -d -m . -e hex
>>> ./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> # file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>>> c6162656c65645f743a733000
>>> trusted.afr.dirty=0x
>>> trusted.bit-rot.version=0x02005991419ce672
>>> trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
>>>
>>>
>>> I'm wondering how they got created in the first place, and if 

Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

2018-03-26 Thread Raghavendra Gowdappa
On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay 
wrote:

> The gfid mismatch here is between the shard and its "link-to" file, the
> creation of which happens at a layer below that of shard translator on the
> stack.
>
> Adding DHT devs to take a look.
>

Thanks Krutika. I assume shard doesn't do any dentry operations like
rename, link, unlink on the path of file (not the gfid handle based path)
internally while managing shards. Can you confirm? If it does these
operations, what fops does it do?

@Ian,

I can suggest following way to fix the problem:
* Since one of files listed is a DHT linkto file, I am assuming there is
only one shard of the file. If not, please list out gfids of other shards
and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a different
gfid, please proceed to step 3. Otherwise abort the healing procedure.
* If cluster.lookup-optimize is set to true abort the healing procedure
* Delete the linkto file - the file with permissions ---T and xattr
trusted.dht.linkto and do a lookup on the file from mount point after
turning off readdriplus [1].

As to reasons on how we ended up in this situation, Can you explain me what
is the I/O pattern on this file - like are there lots of entry operations
like rename, link, unlink etc on the file? There have been known races in
rename/lookup-heal-creating-linkto where linkto and data file have
different gfids. [2] fixes some of these cases

[1] http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html
[2] https://review.gluster.org/#/c/19547/

regards,
Raghavendra

>
>
>> -Krutika
>
> On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday 
> wrote:
>
>> Hello all,
>>
>> We are having a rather interesting problem with one of our VM storage
>> systems. The GlusterFS client is throwing errors relating to GFID
>> mismatches. We traced this down to multiple shards being present on the
>> gluster nodes, with different gfids.
>>
>> Hypervisor gluster mount log:
>>
>> [2018-03-25 18:54:19.261733] E [MSGID: 133010]
>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard:
>> Lookup on shard 7 failed. Base file gfid = 
>> 87137cac-49eb-492a-8f33-8e33470d8cb7
>> [Stale file handle]
>> The message "W [MSGID: 109009] [dht-common.c:2162:dht_lookup_linkfile_cbk]
>> 0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
>> different on data file on ovirt-zone1-replicate-3, gfid local =
>> ----, gfid node =
>> 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
>> [2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
>> [2018-03-25 18:54:19.264349] W [MSGID: 109009]
>> [dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht:
>> /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
>> subvolume ovirt-zone1-replicate-3, gfid local =
>> fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
>> 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
>>
>>
>> On the storage nodes, we found this:
>>
>> [root@n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
>> ./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>> ./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>
>> [root@n1 gluster]# ls -lh ./brick2/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> -T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> [root@n1 gluster]# ls -lh ./brick4/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> -rw-rw. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>>
>> [root@n1 gluster]# getfattr -d -m . -e hex ./brick2/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> # file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>> c6162656c65645f743a733000
>> trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
>> trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e653
>> 12d7265706c69636174652d3300
>>
>> [root@n1 gluster]# getfattr -d -m . -e hex ./brick4/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> # file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>> c6162656c65645f743a733000
>> trusted.afr.dirty=0x
>> trusted.bit-rot.version=0x02005991419ce672
>> trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
>>
>>
>> I'm wondering how they got created in the first place, and if anyone has
>> any insight on how to fix it?
>>
>> Storage nodes:
>> [root@n1 gluster]# gluster --version
>> glusterfs 4.0.0
>>
>> [root@n1 gluster]# gluster volume info
>>
>> Volume Name: ovirt-350-zone1
>> Type: Distributed-Replicate
>> Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 7 x (2 + 1) = 21
>> Transport-type: tcp
>> 

Re: [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

2018-03-26 Thread Krutika Dhananjay
The gfid mismatch here is between the shard and its "link-to" file, the
creation of which happens at a layer below that of shard translator on the
stack.

Adding DHT devs to take a look.

-Krutika

On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday  wrote:

> Hello all,
>
> We are having a rather interesting problem with one of our VM storage
> systems. The GlusterFS client is throwing errors relating to GFID
> mismatches. We traced this down to multiple shards being present on the
> gluster nodes, with different gfids.
>
> Hypervisor gluster mount log:
>
> [2018-03-25 18:54:19.261733] E [MSGID: 133010] 
> [shard.c:1724:shard_common_lookup_shards_cbk]
> 0-ovirt-zone1-shard: Lookup on shard 7 failed. Base file gfid =
> 87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
> The message "W [MSGID: 109009] [dht-common.c:2162:dht_lookup_linkfile_cbk]
> 0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
> different on data file on ovirt-zone1-replicate-3, gfid local =
> ----, gfid node = 
> 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
> " repeated 2 times between [2018-03-25 18:54:19.253748] and [2018-03-25
> 18:54:19.263576]
> [2018-03-25 18:54:19.264349] W [MSGID: 109009]
> [dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht:
> /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on subvolume
> ovirt-zone1-replicate-3, gfid local = fdf0813b-718a-4616-a51b-6999ebba9ec3,
> gfid node = 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
>
>
> On the storage nodes, we found this:
>
> [root@n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
> ./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
> ./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>
> [root@n1 gluster]# ls -lh ./brick2/brick/.shard/87137cac-49eb-492a-8f33-
> 8e33470d8cb7.7
> -T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/
> 87137cac-49eb-492a-8f33-8e33470d8cb7.7
> [root@n1 gluster]# ls -lh ./brick4/brick/.shard/87137cac-49eb-492a-8f33-
> 8e33470d8cb7.7
> -rw-rw. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/
> 87137cac-49eb-492a-8f33-8e33470d8cb7.7
>
> [root@n1 gluster]# getfattr -d -m . -e hex ./brick2/brick/.shard/
> 87137cac-49eb-492a-8f33-8e33470d8cb7.7
> # file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
> security.selinux=0x73797374656d5f753a6f626a6563
> 745f723a756e6c6162656c65645f743a733000
> trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
> trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65
> 312d7265706c69636174652d3300
>
> [root@n1 gluster]# getfattr -d -m . -e hex ./brick4/brick/.shard/
> 87137cac-49eb-492a-8f33-8e33470d8cb7.7
> # file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
> security.selinux=0x73797374656d5f753a6f626a6563
> 745f723a756e6c6162656c65645f743a733000
> trusted.afr.dirty=0x
> trusted.bit-rot.version=0x02005991419ce672
> trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
>
>
> I'm wondering how they got created in the first place, and if anyone has
> any insight on how to fix it?
>
> Storage nodes:
> [root@n1 gluster]# gluster --version
> glusterfs 4.0.0
>
> [root@n1 gluster]# gluster volume info
>
> Volume Name: ovirt-350-zone1
> Type: Distributed-Replicate
> Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 7 x (2 + 1) = 21
> Transport-type: tcp
> Bricks:
> Brick1: 10.0.6.100:/gluster/brick1/brick
> Brick2: 10.0.6.101:/gluster/brick1/brick
> Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
> Brick4: 10.0.6.100:/gluster/brick2/brick
> Brick5: 10.0.6.101:/gluster/brick2/brick
> Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
> Brick7: 10.0.6.100:/gluster/brick3/brick
> Brick8: 10.0.6.101:/gluster/brick3/brick
> Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
> Brick10: 10.0.6.100:/gluster/brick4/brick
> Brick11: 10.0.6.101:/gluster/brick4/brick
> Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
> Brick13: 10.0.6.100:/gluster/brick5/brick
> Brick14: 10.0.6.101:/gluster/brick5/brick
> Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
> Brick16: 10.0.6.100:/gluster/brick6/brick
> Brick17: 10.0.6.101:/gluster/brick6/brick
> Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
> Brick19: 10.0.6.100:/gluster/brick7/brick
> Brick20: 10.0.6.101:/gluster/brick7/brick
> Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
> Options Reconfigured:
> cluster.min-free-disk: 50GB
> performance.strict-write-ordering: off
> performance.strict-o-direct: off
> nfs.disable: off
> performance.readdir-ahead: on
> transport.address-family: inet
> performance.cache-size: 1GB
> features.shard: on
> features.shard-block-size: 5GB
> server.event-threads: 8
> server.outstanding-rpc-limit: 128
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> 

[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

2018-03-25 Thread Ian Halliday

Hello all,

We are having a rather interesting problem with one of our VM storage 
systems. The GlusterFS client is throwing errors relating to GFID 
mismatches. We traced this down to multiple shards being present on the 
gluster nodes, with different gfids.


Hypervisor gluster mount log:

[2018-03-25 18:54:19.261733] E [MSGID: 133010] 
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard: 
Lookup on shard 7 failed. Base file gfid = 
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
The message "W [MSGID: 109009] 
[dht-common.c:2162:dht_lookup_linkfile_cbk] 0-ovirt-zone1-dht: 
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on data 
file on ovirt-zone1-replicate-3, gfid local = 
----, gfid node = 
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between 
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009] 
[dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht: 
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on 
subvolume ovirt-zone1-replicate-3, gfid local = 
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node = 
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56



On the storage nodes, we found this:

[root@n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

[root@n1 gluster]# ls -lh 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-T. 2 root root 0 Mar 25 13:55 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
[root@n1 gluster]# ls -lh 
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw. 2 root root 3.8G Mar 25 13:55 
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7


[root@n1 gluster]# getfattr -d -m . -e hex 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300

[root@n1 gluster]# getfattr -d -m . -e hex 
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x
trusted.bit-rot.version=0x02005991419ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56


I'm wondering how they got created in the first place, and if anyone has 
any insight on how to fix it?


Storage nodes:
[root@n1 gluster]# gluster --version
glusterfs 4.0.0

[root@n1 gluster]# gluster volume info

Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Bricks:
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
Options Reconfigured:
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on


Client version:
[root@kvm573 ~]# gluster --version
glusterfs 3.12.5


Thanks!

- Ian
___
Gluster-users mailing list
Gluster-users@gluster.org

Re: [Gluster-users] Sharding option for distributed volumes

2017-09-22 Thread Pavel Kutishchev

Hello Ji-Hyeon,

Thanks, is that option available in 3.12 gluster release? because we're 
still on 3.8 and just playing around latest version in order to have our 
solution migrated.


Thank you!


9/21/17 2:26 PM, Ji-Hyeon Gim пишет:

Hello Pavel!

In my opinion, you need to check features.shard-block-size option first.
If a file nobigger than this value, it would not be sharded :)

and then If you enablesharding after usinga volume, it may not working.
Sharding is supported in new deployments only, as there is currently no
upgrade path for this feature.


On 2017년 09월 20일 23:23, Pavel Kutishchev wrote:

Hello folks,

Would please someone advice how to use sharding option for distributed
volumes. At this moment i'm facing problem with exporting big file
which are not going to be distributed across bricks inside one volume.

Thank you in advance.



Best regards.

--

Ji-Hyeon Gim
Research Engineer, Gluesys

Address. Gluesys R Center, 5F, 11-31, Simin-daero 327beon-gil,
  Dongan-gu, Anyang-si,
  Gyeonggi-do, Korea
  (14055)
Phone.   +82-70-8787-1053
Fax. +82-31-388-3261
Mobile.  +82-10-7293-8858
E-Mail.  potato...@potatogim.net
Website. www.potatogim.net

The time I wasted today is the tomorrow the dead man was eager to see yesterday.
   - Sophocles




--
Best regards
Pavel Kutishchev
DevOPS Engineer at
Self employed.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Sharding option for distributed volumes

2017-09-21 Thread Pavel Kutishchev

Hello folks,

Would please someone advice how to use sharding option for distributed 
volumes. At this moment i'm facing problem with exporting big file which 
are not going to be distributed across bricks inside one volume.


Thank you in advance.


--
Best regards
Pavel Kutishchev
Golang DevOPS Engineer at
Self employed.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Sharding/Local Gluster Volumes

2017-04-06 Thread Vijay Bellur
On Thu, Apr 6, 2017 at 7:09 AM, Holger Rojahn  wrote:

> Hi,
>
> i ask a question several Days ago ...
> in short: Is it Possible to have 5 Bricks with Sharding enabled and
> replica count 3 to ensure that all files are on 3 of 5 bricks ?
> Manual says i can add any count of bricks to shared volumes but when i
> test it with replica count 2 i can use 4 bricks but not the 5th because it
> says must be 2 bricks to add...
>
> any idea ?
>
>
One possibility could be to split the 5th brick into two different arbiter
bricks. With that you would be able to have 4 data bricks, 2 arbiter bricks
and a replica 3 volume with arbiter.

Regards,
Vijay
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding/Local Gluster Volumes

2017-04-06 Thread lemonnierk
Hi,

If you want replica 3, you must have a multiple of 3 bricks.
So no, you can't use 5 bricks for a replica 3, that's one of the
things gluster can't do unfortunatly.


On Thu, Apr 06, 2017 at 01:09:32PM +0200, Holger Rojahn wrote:
> Hi,
> 
> i ask a question several Days ago ...
> in short: Is it Possible to have 5 Bricks with Sharding enabled and 
> replica count 3 to ensure that all files are on 3 of 5 bricks ?
> Manual says i can add any count of bricks to shared volumes but when i 
> test it with replica count 2 i can use 4 bricks but not the 5th because 
> it says must be 2 bricks to add...
> 
> any idea ?
> 
> greets from dinslaken (germany)
> holger ak icebear
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Sharding/Local Gluster Volumes

2017-04-06 Thread Holger Rojahn

Hi,

i ask a question several Days ago ...
in short: Is it Possible to have 5 Bricks with Sharding enabled and 
replica count 3 to ensure that all files are on 3 of 5 bricks ?
Manual says i can add any count of bricks to shared volumes but when i 
test it with replica count 2 i can use 4 bricks but not the 5th because 
it says must be 2 bricks to add...


any idea ?

greets from dinslaken (germany)
holger ak icebear

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Sharding?

2017-03-10 Thread Cedric Lemarchand
> On 10 Mar 2017, at 12:05, Krutika Dhananjay  wrote:
> 
> On Fri, Mar 10, 2017 at 4:09 PM, Cedric Lemarchand  > wrote:
> 
> > On 10 Mar 2017, at 10:33, Alessandro Briosi  > > wrote:
> >
> > Il 10/03/2017 10:28, Kevin Lemonnier ha scritto:
> >>> I haven't done any test yet, but I was under the impression that
> >>> sharding feature isn't so stable/mature yet.
> >>> In the remote of my mind I remember reading something about a
> >>> bug/situation which caused data corruption.
> >>> Can someone confirm that sharding is stable enough to be used in
> >>> production and won't cause any data loss?
> >> There were a few bugs yeah. I can tell you that in 3.7.15 (and I assume
> >> later versions) it works well as long as you don't try to add new bricks
> >> to your volumes (we use it in production for HA virtual machine disks).
> >> Apparently that bug was fixed recently, so latest versions should be
> >> pretty stable yeah.
> >
> > I'm using 3.8.9, so I suppose all known bugs have been fixed there (also 
> > the one with adding briks)
> >
> > I'll then proceed with some tests before going to production.
> 
> I am still asking myself how such bug could happen on a clustered storage 
> software, where adding bricks is a base feature for scalable solution, like 
> Gluster. Or maybe is it that STM releases are really under tested compared to 
> LTM ones ? Could we states that STM release are really not made for 
> production, or at least really risky ?
> 
> Not entirely true. The same bug existed in LTM release too.
> 
> I did try reproducing the bug on my setup as soon as Lindsay, Kevin and 
> others started reporting about it, but it was never reproducible on my setup.
> Absence of proper logging in libgfapi upon failures only made it harder to 
> debug, even when the users successfully recreated the issue and shared
> their logs. It was only after Satheesaran recreated it successfully with FUSE 
> mount that the real debugging could begin, when fuse-bridge translator
> logged the exact error code for failure.

Indeed an unreproducible bug is pretty hard to fix … thanks for the feed back. 
What would be the best way to find out critical bugs in different Gluster 
releases ? maybe browsing https://review.gluster.org/ or 
https://bugzilla.redhat.com, any advices ?

Cheers

Cédric

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-10 Thread Gandalf Corvotempesta
2017-03-10 11:39 GMT+01:00 Cedric Lemarchand :
> I am still asking myself how such bug could happen on a clustered storage 
> software, where adding bricks is a base feature for scalable solution, like 
> Gluster. Or maybe is it that STM releases are really under tested compared to 
> LTM ones ? Could we states that STM release are really not made for 
> production, or at least really risky ?

This is the same i've reported some months ago.
I think it's probably the worst thing in gluster. Tons of critical
bugs for critial features (that are also the basis features for a
storage software) that lead to data loss and still to be merged.

This kind of bugs *MUST* be addressed, fixed and released *ASAP*, not
after months and months and are still waiting for a review.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Sharding?

2017-03-10 Thread Krutika Dhananjay
On Fri, Mar 10, 2017 at 4:09 PM, Cedric Lemarchand 
wrote:

>
> > On 10 Mar 2017, at 10:33, Alessandro Briosi  wrote:
> >
> > Il 10/03/2017 10:28, Kevin Lemonnier ha scritto:
> >>> I haven't done any test yet, but I was under the impression that
> >>> sharding feature isn't so stable/mature yet.
> >>> In the remote of my mind I remember reading something about a
> >>> bug/situation which caused data corruption.
> >>> Can someone confirm that sharding is stable enough to be used in
> >>> production and won't cause any data loss?
> >> There were a few bugs yeah. I can tell you that in 3.7.15 (and I assume
> >> later versions) it works well as long as you don't try to add new bricks
> >> to your volumes (we use it in production for HA virtual machine disks).
> >> Apparently that bug was fixed recently, so latest versions should be
> >> pretty stable yeah.
> >
> > I'm using 3.8.9, so I suppose all known bugs have been fixed there (also
> the one with adding briks)
> >
> > I'll then proceed with some tests before going to production.
>
> I am still asking myself how such bug could happen on a clustered storage
> software, where adding bricks is a base feature for scalable solution, like
> Gluster. Or maybe is it that STM releases are really under tested compared
> to LTM ones ? Could we states that STM release are really not made for
> production, or at least really risky ?
>

Not entirely true. The same bug existed in LTM release too.

I did try reproducing the bug on my setup as soon as Lindsay, Kevin and
others started reporting about it, but it was never reproducible on my
setup.
Absence of proper logging in libgfapi upon failures only made it harder to
debug, even when the users successfully recreated the issue and shared
their logs. It was only after Satheesaran recreated it successfully with
FUSE mount that the real debugging could begin, when fuse-bridge translator
logged the exact error code for failure.

-Krutika


> Sorry if the question could sounds a bit rude, but I think it still
> remains for newish peoples that had to make a choice on which release is
> better for production ;-)
>
> Cheers
>
> Cédric
>
> >
> > Thank you
> >
> > ___
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-users
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-10 Thread Cedric Lemarchand

> On 10 Mar 2017, at 10:33, Alessandro Briosi  wrote:
> 
> Il 10/03/2017 10:28, Kevin Lemonnier ha scritto:
>>> I haven't done any test yet, but I was under the impression that
>>> sharding feature isn't so stable/mature yet.
>>> In the remote of my mind I remember reading something about a
>>> bug/situation which caused data corruption.
>>> Can someone confirm that sharding is stable enough to be used in
>>> production and won't cause any data loss?
>> There were a few bugs yeah. I can tell you that in 3.7.15 (and I assume
>> later versions) it works well as long as you don't try to add new bricks
>> to your volumes (we use it in production for HA virtual machine disks).
>> Apparently that bug was fixed recently, so latest versions should be
>> pretty stable yeah.
> 
> I'm using 3.8.9, so I suppose all known bugs have been fixed there (also the 
> one with adding briks)
> 
> I'll then proceed with some tests before going to production.

I am still asking myself how such bug could happen on a clustered storage 
software, where adding bricks is a base feature for scalable solution, like 
Gluster. Or maybe is it that STM releases are really under tested compared to 
LTM ones ? Could we states that STM release are really not made for production, 
or at least really risky ?

Sorry if the question could sounds a bit rude, but I think it still remains for 
newish peoples that had to make a choice on which release is better for 
production ;-)

Cheers

Cédric

> 
> Thank you
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-10 Thread Krutika Dhananjay
On Fri, Mar 10, 2017 at 3:03 PM, Alessandro Briosi  wrote:

> Il 10/03/2017 10:28, Kevin Lemonnier ha scritto:
>
> I haven't done any test yet, but I was under the impression that
> sharding feature isn't so stable/mature yet.
> In the remote of my mind I remember reading something about a
> bug/situation which caused data corruption.
> Can someone confirm that sharding is stable enough to be used in
> production and won't cause any data loss?
>
> There were a few bugs yeah. I can tell you that in 3.7.15 (and I assume
> later versions) it works well as long as you don't try to add new bricks
> to your volumes (we use it in production for HA virtual machine disks).
> Apparently that bug was fixed recently, so latest versions should be
> pretty stable yeah.
>
>
> I'm using 3.8.9, so I suppose all known bugs have been fixed there (also
> the one with adding briks)
>

No. That one is out for review and yet to be merged.

... which again reminds me ...

Niels,

Care to merge the two patches?

https://review.gluster.org/#/c/16749/
https://review.gluster.org/#/c/16750/

-Krutika


> I'll then proceed with some tests before going to production.
>
> Thank you
>
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-10 Thread Kevin Lemonnier
> I'm using 3.8.9, so I suppose all known bugs have been fixed there (also
> the one with adding briks)

Can't comment on that, I just saw they fixed it, not sure in which version.
I'd wait for someone who knows to confirm that before going into production
if adding bricks is something you'll need !

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-10 Thread Alessandro Briosi
Il 10/03/2017 10:28, Kevin Lemonnier ha scritto:
>> I haven't done any test yet, but I was under the impression that
>> sharding feature isn't so stable/mature yet.
>> In the remote of my mind I remember reading something about a
>> bug/situation which caused data corruption.
>> Can someone confirm that sharding is stable enough to be used in
>> production and won't cause any data loss?
> There were a few bugs yeah. I can tell you that in 3.7.15 (and I assume
> later versions) it works well as long as you don't try to add new bricks
> to your volumes (we use it in production for HA virtual machine disks).
> Apparently that bug was fixed recently, so latest versions should be
> pretty stable yeah.

I'm using 3.8.9, so I suppose all known bugs have been fixed there (also
the one with adding briks)

I'll then proceed with some tests before going to production.

Thank you

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-10 Thread Kevin Lemonnier
> I haven't done any test yet, but I was under the impression that
> sharding feature isn't so stable/mature yet.
> In the remote of my mind I remember reading something about a
> bug/situation which caused data corruption.
> Can someone confirm that sharding is stable enough to be used in
> production and won't cause any data loss?

There were a few bugs yeah. I can tell you that in 3.7.15 (and I assume
later versions) it works well as long as you don't try to add new bricks
to your volumes (we use it in production for HA virtual machine disks).
Apparently that bug was fixed recently, so latest versions should be
pretty stable yeah.

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-10 Thread Alessandro Briosi
Il 09/03/2017 17:17, Vijay Bellur ha scritto:
>
>
> On Thu, Mar 9, 2017 at 11:10 AM, Kevin Lemonnier  > wrote:
>
> > I've seen the term sharding pop up on the list a number of times
> but I
> > haven't found any documentation or explanation of what it is.
> Would someone
> > please enlighten me?
>
> It's a way to split the files you put on the volume. With a shard
> size of 64 MB
> for example, the biggest file on the volume will be 64 MB. It's
> transparent
> when accessing the files though, you can still of course write
> your 2 TB file
> and access it as usual.
>
> It's usefull for things like healing (only the shard being headed
> is locked,
> and you have a lot less data to transfert) and for things like
> hosting a single
> huge file that would be bigger than one of your replicas.
>
> We use it for VM disks, as it decreases heal times a lot.
>
>
>
> Some more details on sharding can be found at [1].
>
>  


I haven't done any test yet, but I was under the impression that
sharding feature isn't so stable/mature yet.
In the remote of my mind I remember reading something about a
bug/situation which caused data corruption.
Can someone confirm that sharding is stable enough to be used in
production and won't cause any data loss?

I'd really would like to use it, as it would probably speed up healing a
lot (as I use it to store vm disks).

Thanks,
Alessandro
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-09 Thread Laura Bailey
Hi folks,

This chapter on sharding and how to configure it went into the RHGS 3.1
Administration Guide some time ago:
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Managing_Sharding.html

If there's anything in here that isn't clear, let me know so I can fix it.

It doesn't seem to show up if you search on the customer portal; I'll get
in touch with JP Sherman and see what we can do about that.

Cheers,
Laura B

On Fri, Mar 10, 2017 at 2:17 AM, Vijay Bellur  wrote:

>
>
> On Thu, Mar 9, 2017 at 11:10 AM, Kevin Lemonnier 
> wrote:
>
>> > I've seen the term sharding pop up on the list a number of times but I
>> > haven't found any documentation or explanation of what it is. Would
>> someone
>> > please enlighten me?
>>
>> It's a way to split the files you put on the volume. With a shard size of
>> 64 MB
>> for example, the biggest file on the volume will be 64 MB. It's
>> transparent
>> when accessing the files though, you can still of course write your 2 TB
>> file
>> and access it as usual.
>>
>> It's usefull for things like healing (only the shard being headed is
>> locked,
>> and you have a lot less data to transfert) and for things like hosting a
>> single
>> huge file that would be bigger than one of your replicas.
>>
>> We use it for VM disks, as it decreases heal times a lot.
>>
>>
>
> Some more details on sharding can be found at [1].
>
> Regards,
> Vijay
>
> [1] http://blog.gluster.org/2015/12/introducing-shard-translator/
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>



-- 
Laura Bailey
Senior Technical Writer
Customer Content Services BNE
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-09 Thread Vijay Bellur
On Thu, Mar 9, 2017 at 11:10 AM, Kevin Lemonnier 
wrote:

> > I've seen the term sharding pop up on the list a number of times but I
> > haven't found any documentation or explanation of what it is. Would
> someone
> > please enlighten me?
>
> It's a way to split the files you put on the volume. With a shard size of
> 64 MB
> for example, the biggest file on the volume will be 64 MB. It's transparent
> when accessing the files though, you can still of course write your 2 TB
> file
> and access it as usual.
>
> It's usefull for things like healing (only the shard being headed is
> locked,
> and you have a lot less data to transfert) and for things like hosting a
> single
> huge file that would be bigger than one of your replicas.
>
> We use it for VM disks, as it decreases heal times a lot.
>
>

Some more details on sharding can be found at [1].

Regards,
Vijay

[1] http://blog.gluster.org/2015/12/introducing-shard-translator/
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding?

2017-03-09 Thread Kevin Lemonnier
> I've seen the term sharding pop up on the list a number of times but I
> haven't found any documentation or explanation of what it is. Would someone
> please enlighten me?

It's a way to split the files you put on the volume. With a shard size of 64 MB
for example, the biggest file on the volume will be 64 MB. It's transparent
when accessing the files though, you can still of course write your 2 TB file
and access it as usual.

It's usefull for things like healing (only the shard being headed is locked,
and you have a lot less data to transfert) and for things like hosting a single
huge file that would be bigger than one of your replicas.

We use it for VM disks, as it decreases heal times a lot.

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111


signature.asc
Description: Digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Sharding?

2017-03-09 Thread Jake Davis
I've seen the term sharding pop up on the list a number of times but I
haven't found any documentation or explanation of what it is. Would someone
please enlighten me?

Many Thanks,
-Jake
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding - what next?

2015-12-16 Thread Lindsay Mathieson

On 16/12/15 22:59, Krutika Dhananjay wrote:
I guess I did not make myself clear. Apologies. I meant to say that 
printing a single list of counts aggregated
from all bricks can be tricky and is susceptible to the possibility of 
same entry getting counted multiple times
if the inode needs a heal on multiple bricks. Eliminating such 
duplicates would be rather difficult.


Or, we could have a sub-command of heal-info dump all the file 
paths/gfids that need heal from all bricks and
you could pipe the output to 'sort | uniq | wc -l' to eliminate 
duplicates. Would that be OK? :)



Sorry, my fault - I did understand that. Aggregate counts per brick 
would be fine, I have no desire to complicate things for the devs :)




--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Sharding - what next?

2015-12-16 Thread Krutika Dhananjay
- Original Message -

> From: "Lindsay Mathieson" 
> To: "Krutika Dhananjay" 
> Cc: "Gluster Devel" , "gluster-users"
> 
> Sent: Wednesday, December 16, 2015 6:56:03 AM
> Subject: Re: Sharding - what next?

> Hi, late reply again ...

> On 10/12/2015 5:33 PM, Krutika Dhananjay wrote:

> > There is a 'heal-info summary' command that is under review, written by
> > Mohammed Ashiq @ http://review.gluster.org/#/c/12154/3 which prints the
> > number of files that are yet to be healed.
> 
> > It could perhaps be enhanced to print files in split-brain and also files
> > which are possibly being healed. Note that these counts are printed per
> > brick.
> 
> > It does not print a single list of counts with aggregated values. Would
> > that
> > be something you would consider useful?
> 

> Very much so, that would be perfect.

> I can get close to this just with the following

> gluster volume heal datastore1 info | grep 'Brick\|Number'

> And if one is feeling fancy or just wants to keep an eye on progress

> watch "gluster volume heal datastore1 info | grep 'Brick\|Number'"

> though of course this runs afoul of the heal info delay.
I guess I did not make myself clear. Apologies. I meant to say that printing a 
single list of counts aggregated 
from all bricks can be tricky and is susceptible to the possibility of same 
entry getting counted multiple times 
if the inode needs a heal on multiple bricks. Eliminating such duplicates would 
be rather difficult. 

Or, we could have a sub-command of heal-info dump all the file paths/gfids that 
need heal from all bricks and 
you could pipe the output to 'sort | uniq | wc -l' to eliminate duplicates. 
Would that be OK? :) 

-Krutika 

> > > Also, it would be great if the heal info command could return faster,
> > > sometimes it takes over a minute.
> > 
> 
> > Yeah, I think part of the problem could be eager-lock feature which is
> > causing the GlusterFS client process to not relinquish the network lock on
> > the file soon enough, causing the heal info utility to be blocked for
> > longer
> > duration.
> 
> > There is an enhancement Anuradha Talur is working on where heal-info would
> > do
> > away with taking locks altogether. Once that is in place, heal-info should
> > return faster.
> 

> Excellent, I look fwd to that. Even if removing the locks results in the
> occasional inaccurate cout, I don't think that would mattter - From my POV
> its an indicator, not a absolute.

> Thanks,
> --
> Lindsay Mathieson
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding - what next?

2015-12-15 Thread Lindsay Mathieson

Hi, late reply again ...

On 10/12/2015 5:33 PM, Krutika Dhananjay wrote:
There is a 'heal-info summary' command that is under review, written 
by Mohammed Ashiq @ http://review.gluster.org/#/c/12154/3 which prints 
the number of files that are yet to be healed.
It could perhaps be enhanced to print files in split-brain and also 
files which are possibly being healed. Note that these counts are 
printed per brick.
It does not print a single list of counts with aggregated values. 
Would that be something you would consider useful?


Very much so, that would be perfect.

I can get close to this just with the following

gluster volume heal datastore1 info | grep 'Brick\|Number'


And if one is feeling fancy or just wants to keep an eye on progress

watch "gluster volume heal datastore1 info | grep 'Brick\|Number'"

though of course this runs afoul of the heal info delay.






Also, it would be great if the heal info command could return
faster, sometimes it takes over a minute.

Yeah, I think part of the problem could be eager-lock feature which is 
causing the GlusterFS client process to not relinquish the network 
lock on the file soon enough, causing the heal info utility to be 
blocked for longer duration.
There is an enhancement Anuradha Talur is working on where heal-info 
would do away with taking locks altogether. Once that is in place, 
heal-info should return faster.





Excellent, I look fwd to that. Even if removing the locks results in the 
occasional inaccurate cout, I don't think that would mattter - From my 
POV its an indicator, not a absolute.


Thanks,

--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding - what next?

2015-12-09 Thread Lindsay Mathieson
Hi Guys, sorry for the late reply, my attention tends to be somewhat 
sporadic due to work and the large number of rescue dogs/cats I care for :)


On 3/12/2015 8:34 PM, Krutika Dhananjay wrote:
We would love to hear from you on what you think of the feature and 
where it could be improved.

Specifically, the following are the questions we are seeking feedback on:
a) your experience testing sharding with VM store use-case - any bugs 
you ran into, any performance issues, etc


Testing was initially somewhat stressful as I regularly encountered file 
corruption. However I don't think that was due to bugs, rather incorrect 
settings for the VM usecase. Once I got that sorted out it has been very 
stable - I have really stressed failure modes we run into at work - 
nodes going down while heavy writes were happening. Live migrations 
during heals. gluster software being killed while VM were running on the 
host. So far its held up without a hitch.


To that end, one thing I think should be made more obvious is the 
settings required for VM Hosting:


   quick-read=off
   read-ahead=off
   io-cache=off
   stat-prefetch=off
   eager-lock=enable
   remote-dio=enable
   quorum-type=auto
   server-quorum-type=server

They are  quite crucial and very easy to miss in the online docs. And 
they are only recommended with noo mention that you will corrupt KVM 
VM's if you live migrate them between gluster nodes without them set. 
Also the virt group is missing from the debian packages.


Setting them does seem to have slowed sequential writes by about 10% but 
I need to test that more.



Something related - sharding is useful because it makes heals much more 
granular and hence faster. To that end it would be really useful if 
there was a heal info variant that gave a overview of the process - 
rather than list the shards that are being healed, just a aggregate 
total, e.g.


$ gluster volume heal datastore1 status
volume datastore1
- split brain: 0
- Wounded:65
- healing:4

It gives one a easy feeling of progress - heals aren't happening faster, 
but it would feel that way :)



Also, it would be great if the heal info command could return faster, 
sometimes it takes over a minute.


Thanks for the great work,

Lindsay
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Sharding - what next?

2015-12-09 Thread Krutika Dhananjay
- Original Message -

> From: "Lindsay Mathieson" 
> To: "Krutika Dhananjay" , "Gluster Devel"
> , "gluster-users" 
> Sent: Wednesday, December 9, 2015 6:48:40 PM
> Subject: Re: Sharding - what next?

> Hi Guys, sorry for the late reply, my attention tends to be somewhat sporadic
> due to work and the large number of rescue dogs/cats I care for :)

> On 3/12/2015 8:34 PM, Krutika Dhananjay wrote:

> > We would love to hear from you on what you think of the feature and where
> > it
> > could be improved.
> 
> > Specifically, the following are the questions we are seeking feedback on:
> 
> > a) your experience testing sharding with VM store use-case - any bugs you
> > ran
> > into, any performance issues, etc
> 

> Testing was initially somewhat stressful as I regularly encountered file
> corruption. However I don't think that was due to bugs, rather incorrect
> settings for the VM usecase. Once I got that sorted out it has been very
> stable - I have really stressed failure modes we run into at work - nodes
> going down while heavy writes were happening. Live migrations during heals.
> gluster software being killed while VM were running on the host. So far its
> held up without a hitch.

> To that end, one thing I think should be made more obvious is the settings
> required for VM Hosting:

> > quick-read=off
> 
> > read-ahead=off
> 
> > io-cache=off
> 
> > stat-prefetch=off
> 
> > eager-lock=enable
> 
> > remote-dio=enable
> 
> > quorum-type=auto
> 
> > server-quorum-type=server
> 

> They are quite crucial and very easy to miss in the online docs. And they are
> only recommended with noo mention that you will corrupt KVM VM's if you live
> migrate them between gluster nodes without them set. Also the virt group is
> missing from the debian packages.
Hi Lindsay, 
Thanks for the feedback. I will get in touch with Humble to find out what can 
be done about the docs. 

> Setting them does seem to have slowed sequential writes by about 10% but I
> need to test that more.

> Something related - sharding is useful because it makes heals much more
> granular and hence faster. To that end it would be really useful if there
> was a heal info variant that gave a overview of the process - rather than
> list the shards that are being healed, just a aggregate total, e.g.

> $ gluster volume heal datastore1 status
> volume datastore1
> - split brain: 0
> - Wounded:65
> - healing:4

> It gives one a easy feeling of progress - heals aren't happening faster, but
> it would feel that way :)
There is a 'heal-info summary' command that is under review, written by 
Mohammed Ashiq @ http://review.gluster.org/#/c/12154/3 which prints the number 
of files that are yet to be healed. 
It could perhaps be enhanced to print files in split-brain and also files which 
are possibly being healed. Note that these counts are printed per brick. 
It does not print a single list of counts with aggregated values. Would that be 
something you would consider useful? 

> Also, it would be great if the heal info command could return faster,
> sometimes it takes over a minute.
Yeah, I think part of the problem could be eager-lock feature which is causing 
the GlusterFS client process to not relinquish the network lock on the file 
soon enough, causing the heal info utility to be blocked for longer duration. 
There is an enhancement Anuradha Talur is working on where heal-info would do 
away with taking locks altogether. Once that is in place, heal-info should 
return faster. 

-Krutika 

> Thanks for the great work,

> Lindsay
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Sharding - what next?

2015-12-03 Thread Krutika Dhananjay
Hi, 

When we designed and wrote sharding feature in GlusterFS, our focus had been 
single-writer-to-large-files use cases, chief among these being the virtual 
machine image store use-case. 
Sharding, for the uninitiated, is a feature that was introduced in 
glusterfs-3.7.0 release with 'experimental' status. 
Here is some documentation that explains what it does at a high level: 
http://www.gluster.org/community/documentation/index.php/Features/sharding-xlator
 
https://gluster.readthedocs.org/en/release-3.7.0/Features/shard/ 

We have now reached that stage where the feature is considered stable for the 
VM store use case 
after several rounds of testing (thanks to Lindsay Mathieson, Paul Cuzner and 
Satheesaran Sundaramoorthi), 
bug fixing and reviews (thanks to Pranith Karampuri). Also in this regard, 
patches have been sent to make 
sharding work with geo-replication, thanks to Kotresh's efforts (testing still 
in progress). 

We would love to hear from you on what you think of the feature and where it 
could be improved. 
Specifically, the following are the questions we are seeking feedback on: 
a) your experience testing sharding with VM store use-case - any bugs you ran 
into, any performance issues, etc 
b) what are the other large-file use-cases (apart from the VM store workload) 
you know of or use, 
where you think having sharding capability will be useful. 

Based on your feedback we will start work on making sharding work in other 
workloads and/or with other existing GlusterFS features. 

Thanks, 
Krutika 


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users