Re: [Gluster-users] disperse volume file to subvolume mapping

Serkan Çoban Thu, 21 Apr 2016 01:07:53 -0700

>I think the problem is in the temporary name that distcp gives to the file 
>while it's being copied before renaming it to the real name. Do you know what 
>is the structure of this name ?
Distcp temporary file name format is:
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.


I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.

>2. define the option 'extra-hash-regex' to an expression that matches your 
>temporary file names and returns the same name that will finally have. 
>Depending on the differences between original and temporary file names, this 
>option could be useless.
>3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name 
>conversion, so the files will be evenly distributed. However this will cause a 
>lot of files placed in incorrect subvolumes, creating a lot of link files 
>until a rebalance is executed.

How can I set these options?



On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
<[email protected]> wrote:
> Hi Serkan,
>
> I think the problem is in the temporary name that distcp gives to the file
> while it's being copied before renaming it to the real name. Do you know
> what is the structure of this name ?
>
> DHT selects the subvolume (in this case the ec set) on which the file will
> be stored based on the name of the file. This has a problem when a file is
> being renamed, because this could change the subvolume where the file should
> be found.
>
> DHT has a feature to avoid incorrect file placements when executing renames
> for the rsync case. What it does is to check if the file matches the
> following regular expression:
>
>     ^\.(.+)\.[^.]+$
>
> If a match is found, it only considers the part between parenthesis to
> calculate the destination subvolume.
>
> This is useful for rsync because temporary file names are constructed in the
> following way: suppose the original filename is 'test'. The temporary
> filename while rsync is being executed is made by prepending a dot and
> appending '.<random chars>': .test.712hd
>
> As you can see, the original name and the part of the name between
> parenthesis that matches the regular expression are the same. This causes
> that, after renaming the temporary file to its original filename, both files
> will be considered to belong to the same subvolume by DHT.
>
> In your case it's very probable that distcp uses a temporary name like
> '.part.<number>'. In this case the portion of the name used to select the
> subvolume is always 'part'. This would explain why all files go to the same
> subvolume. Once the file is renamed to another name, DHT realizes that it
> should go to another subvolume. At this point it creates a link file (those
> files with access rights = '---------T') in the correct subvolume but it
> doesn't move it. As you can see, this kind of files are better balanced.
>
> To solve this problem you have three options:
>
> 1. change the temporary filename used by distcp to correctly match the
> regular expression. I'm not sure if this can be configured, but if this is
> possible, this is the best option.
>
> 2. define the option 'extra-hash-regex' to an expression that matches your
> temporary file names and returns the same name that will finally have.
> Depending on the differences between original and temporary file names, this
> option could be useless.
>
> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
> conversion, so the files will be evenly distributed. However this will cause
> a lot of files placed in incorrect subvolumes, creating a lot of link files
> until a rebalance is executed.
>
> Xavi
>
>
> On 20/04/16 14:13, Serkan Çoban wrote:
>>
>> Here is the steps that I do in detail and relevant output from bricks:
>>
>> I am using below command for volume creation:
>> gluster volume create v0 disperse 20 redundancy 4 \
>> 1.1.1.{185..204}:/bricks/02 \
>> 1.1.1.{205..224}:/bricks/02 \
>> 1.1.1.{225..244}:/bricks/02 \
>> 1.1.1.{185..204}:/bricks/03 \
>> 1.1.1.{205..224}:/bricks/03 \
>> 1.1.1.{225..244}:/bricks/03 \
>> 1.1.1.{185..204}:/bricks/04 \
>> 1.1.1.{205..224}:/bricks/04 \
>> 1.1.1.{225..244}:/bricks/04 \
>> 1.1.1.{185..204}:/bricks/05 \
>> 1.1.1.{205..224}:/bricks/05 \
>> 1.1.1.{225..244}:/bricks/05 \
>> 1.1.1.{185..204}:/bricks/06 \
>> 1.1.1.{205..224}:/bricks/06 \
>> 1.1.1.{225..244}:/bricks/06 \
>> 1.1.1.{185..204}:/bricks/07 \
>> 1.1.1.{205..224}:/bricks/07 \
>> 1.1.1.{225..244}:/bricks/07 \
>> 1.1.1.{185..204}:/bricks/08 \
>> 1.1.1.{205..224}:/bricks/08 \
>> 1.1.1.{225..244}:/bricks/08 \
>> 1.1.1.{185..204}:/bricks/09 \
>> 1.1.1.{205..224}:/bricks/09 \
>> 1.1.1.{225..244}:/bricks/09 \
>> 1.1.1.{185..204}:/bricks/10 \
>> 1.1.1.{205..224}:/bricks/10 \
>> 1.1.1.{225..244}:/bricks/10 \
>> 1.1.1.{185..204}:/bricks/11 \
>> 1.1.1.{205..224}:/bricks/11 \
>> 1.1.1.{225..244}:/bricks/11 \
>> 1.1.1.{185..204}:/bricks/12 \
>> 1.1.1.{205..224}:/bricks/12 \
>> 1.1.1.{225..244}:/bricks/12 \
>> 1.1.1.{185..204}:/bricks/13 \
>> 1.1.1.{205..224}:/bricks/13 \
>> 1.1.1.{225..244}:/bricks/13 \
>> 1.1.1.{185..204}:/bricks/14 \
>> 1.1.1.{205..224}:/bricks/14 \
>> 1.1.1.{225..244}:/bricks/14 \
>> 1.1.1.{185..204}:/bricks/15 \
>> 1.1.1.{205..224}:/bricks/15 \
>> 1.1.1.{225..244}:/bricks/15 \
>> 1.1.1.{185..204}:/bricks/16 \
>> 1.1.1.{205..224}:/bricks/16 \
>> 1.1.1.{225..244}:/bricks/16 \
>> 1.1.1.{185..204}:/bricks/17 \
>> 1.1.1.{205..224}:/bricks/17 \
>> 1.1.1.{225..244}:/bricks/17 \
>> 1.1.1.{185..204}:/bricks/18 \
>> 1.1.1.{205..224}:/bricks/18 \
>> 1.1.1.{225..244}:/bricks/18 \
>> 1.1.1.{185..204}:/bricks/19 \
>> 1.1.1.{205..224}:/bricks/19 \
>> 1.1.1.{225..244}:/bricks/19 \
>> 1.1.1.{185..204}:/bricks/20 \
>> 1.1.1.{205..224}:/bricks/20 \
>> 1.1.1.{225..244}:/bricks/20 \
>> 1.1.1.{185..204}:/bricks/21 \
>> 1.1.1.{205..224}:/bricks/21 \
>> 1.1.1.{225..244}:/bricks/21 \
>> 1.1.1.{185..204}:/bricks/22 \
>> 1.1.1.{205..224}:/bricks/22 \
>> 1.1.1.{225..244}:/bricks/22 \
>> 1.1.1.{185..204}:/bricks/23 \
>> 1.1.1.{205..224}:/bricks/23 \
>> 1.1.1.{225..244}:/bricks/23 \
>> 1.1.1.{185..204}:/bricks/24 \
>> 1.1.1.{205..224}:/bricks/24 \
>> 1.1.1.{225..244}:/bricks/24 \
>> 1.1.1.{185..204}:/bricks/25 \
>> 1.1.1.{205..224}:/bricks/25 \
>> 1.1.1.{225..244}:/bricks/25 \
>> 1.1.1.{185..204}:/bricks/26 \
>> 1.1.1.{205..224}:/bricks/26 \
>> 1.1.1.{225..244}:/bricks/26 \
>> 1.1.1.{185..204}:/bricks/27 \
>> 1.1.1.{205..224}:/bricks/27 \
>> 1.1.1.{225..244}:/bricks/27 force
>>
>> then I mount volume on 50 clients:
>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
>>
>> then I make a directory from one of the clients and chmod it.
>> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
>>
>> then I start distcp on clients, there are 1059X8.8GB files in one folder
>> and
>> they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
>> copy jobs per client at same time.
>> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
>> file:///mnt/gluster/s1
>>
>> After job finished here is the status of s1 directory from bricks:
>> s1 directory is present in all 1560 brick.
>> s1/teragen-10tb folder is present in all 1560 brick.
>>
>> full listing of files in bricks:
>> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
>>
>> You can ignore the .crc files in the brick output above, they are
>> checksum files...
>>
>> As you can see part-m-xxxx files written only some bricks in nodes
>> 0205..0224
>> All bricks have some files but they have zero size.
>>
>> I increase file descriptors to 65k so it is not the issue...
>>
>>
>>
>>
>>
>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez <[email protected]>
>> wrote:
>>>
>>> Hi Serkan,
>>>
>>> On 19/04/16 15:16, Serkan Çoban wrote:
>>>>>>>
>>>>>>>
>>>>>>> I assume that gluster is used to store the intermediate files before
>>>>>>> the reduce phase
>>>>
>>>>
>>>> Nope, gluster is the destination for distcp command. hadoop distcp -m
>>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster
>>>> This run maps on datanodes which have /mnt/gluster mounted on all of
>>>> them.
>>>
>>>
>>>
>>> I don't know hadoop, so I'm of little help here. However it seems that -m
>>> 50
>>> means to execute 50 copies in parallel. This means that even if the
>>> distribution worked fine, at most 50 (much probably less) of the 78 ec
>>> sets
>>> would be used in parallel.
>>>
>>>>
>>>>>>> This means that this is caused by some peculiarity of the mapreduce.
>>>>
>>>>
>>>> Yes but how a client write 500 files to gluster mount and those file
>>>> just written only to subset of subvolumes? I cannot use gluster as a
>>>> backup cluster if I cannot write with distcp.
>>>>
>>>
>>> All 500 files were created only on one of the 78 ec sets and the
>>> remaining
>>> 77 got empty ?
>>>
>>>>>>> You should look which files are created in each brick and how many
>>>>>>> while the process is running.
>>>>
>>>>
>>>> Files only created on nodes 185..204 or 205..224 or 225..244. Only on
>>>> 20 nodes in each test.
>>>
>>>
>>>
>>> How many files there were in each brick ?
>>>
>>> Not sure if this can be related, but standard linux distributions have a
>>> default limit of 1024 open file descriptors. Having a so big volume and
>>> doing a massive copy, maybe this limit is affecting something ?
>>>
>>> Are there any error or warning messages in the mount or bricks logs ?
>>>
>>>
>>> Xavi
>>>
>>>>
>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
>>>> <[email protected]>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hi Serkan,
>>>>>
>>>>> moved to gluster-users since this doesn't belong to devel list.
>>>>>
>>>>> On 19/04/16 11:24, Serkan Çoban wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am copying 10.000 files to gluster volume using mapreduce on
>>>>>> clients. Each map process took one file at a time and copy it to
>>>>>> gluster volume.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I assume that gluster is used to store the intermediate files before
>>>>> the
>>>>> reduce phase.
>>>>>
>>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
>>>>>> copy >78 files parallel I expect each file goes to different subvolume
>>>>>> right?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> If you only copy 78 files, most probably you will get some subvolume
>>>>> empty
>>>>> and some other with more than one or two files. It's not an exact
>>>>> distribution, it's a statistially balanced distribution: over time and
>>>>> with
>>>>> enough files, each brick will contain an amount of files in the same
>>>>> order
>>>>> of magnitude, but they won't have the *same* number of files.
>>>>>
>>>>>> In my tests during tests with fio I can see every file goes to
>>>>>> different subvolume, but when I start mapreduce process from clients
>>>>>> only 78/3=26 subvolumes used for writing files.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This means that this is caused by some peculiarity of the mapreduce.
>>>>>
>>>>>> I see that clearly from network traffic. Mapreduce on client side can
>>>>>> be run multi thread. I tested with 1-5-10 threads on each client but
>>>>>> every time only 26 subvolumes used.
>>>>>> How can I debug the issue further?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> You should look which files are created in each brick and how many
>>>>> while
>>>>> the
>>>>> process is running.
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>>>
>>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Serkan,
>>>>>>>
>>>>>>> On 19/04/16 09:18, Serkan Çoban wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
>>>>>>>> behavior.
>>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce to
>>>>>>>> gluster
>>>>>>>> using one thread per server and they are using only 20 servers out
>>>>>>>> of
>>>>>>>> 60. On the other hand fio tests use all the servers. Anything I can
>>>>>>>> do
>>>>>>>> to solve the issue?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Distribution of files to ec sets is done by dht. In theory if you
>>>>>>> create
>>>>>>> many files each ec set will receive the same amount of files. However
>>>>>>> when
>>>>>>> the number of files is small enough, statistics can fail.
>>>>>>>
>>>>>>> Not sure what you are doing exactly, but a mapreduce procedure
>>>>>>> generally
>>>>>>> only creates a single output. In that case it makes sense that only
>>>>>>> one
>>>>>>> ec
>>>>>>> set is used. If you want to use all ec sets for a single file, you
>>>>>>> should
>>>>>>> enable sharding (I haven't tested that) or split the result in
>>>>>>> multiple
>>>>>>> files.
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Serkan
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------- Forwarded message ----------
>>>>>>>> From: Serkan Çoban <[email protected]>
>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM
>>>>>>>> Subject: disperse volume file to subvolume mapping
>>>>>>>> To: Gluster Users <[email protected]>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi, I have a problem where clients are using only 1/3 of nodes in
>>>>>>>> disperse volume for writing.
>>>>>>>> I am testing from 50 clients using 1 to 10 threads with file names
>>>>>>>> part-0-xxxx.
>>>>>>>> What I see is clients only use 20 nodes for writing. How is the file
>>>>>>>> name to sub volume hashing is done? Is this related to file names
>>>>>>>> are
>>>>>>>> similar?
>>>>>>>>
>>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse
>>>>>>>> volume
>>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
>>>>>>>>
>>>>>>>
>>>>>
>>>
>
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] disperse volume file to subvolume mapping

Reply via email to