>I think the problem is in the temporary name that distcp gives to the file >while it's being copied before renaming it to the real name. Do you know what >is the structure of this name ? Distcp temporary file name format is: ".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same temporary file name used by one map process. For example I see in the logs that one map copies files part-m-00031,part-m-00047,part-m-00063 sequentially and they all use same temporary file name above. So no original file name appears in temporary file name.
I will check if we can modify distcp behaviour, or we have to write our mapreduce procedures instead of using distcp. >2. define the option 'extra-hash-regex' to an expression that matches your >temporary file names and returns the same name that will finally have. >Depending on the differences between original and temporary file names, this >option could be useless. >3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name >conversion, so the files will be evenly distributed. However this will cause a >lot of files placed in incorrect subvolumes, creating a lot of link files >until a rebalance is executed. How can I set these options? On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez <[email protected]> wrote: > Hi Serkan, > > I think the problem is in the temporary name that distcp gives to the file > while it's being copied before renaming it to the real name. Do you know > what is the structure of this name ? > > DHT selects the subvolume (in this case the ec set) on which the file will > be stored based on the name of the file. This has a problem when a file is > being renamed, because this could change the subvolume where the file should > be found. > > DHT has a feature to avoid incorrect file placements when executing renames > for the rsync case. What it does is to check if the file matches the > following regular expression: > > ^\.(.+)\.[^.]+$ > > If a match is found, it only considers the part between parenthesis to > calculate the destination subvolume. > > This is useful for rsync because temporary file names are constructed in the > following way: suppose the original filename is 'test'. The temporary > filename while rsync is being executed is made by prepending a dot and > appending '.<random chars>': .test.712hd > > As you can see, the original name and the part of the name between > parenthesis that matches the regular expression are the same. This causes > that, after renaming the temporary file to its original filename, both files > will be considered to belong to the same subvolume by DHT. > > In your case it's very probable that distcp uses a temporary name like > '.part.<number>'. In this case the portion of the name used to select the > subvolume is always 'part'. This would explain why all files go to the same > subvolume. Once the file is renamed to another name, DHT realizes that it > should go to another subvolume. At this point it creates a link file (those > files with access rights = '---------T') in the correct subvolume but it > doesn't move it. As you can see, this kind of files are better balanced. > > To solve this problem you have three options: > > 1. change the temporary filename used by distcp to correctly match the > regular expression. I'm not sure if this can be configured, but if this is > possible, this is the best option. > > 2. define the option 'extra-hash-regex' to an expression that matches your > temporary file names and returns the same name that will finally have. > Depending on the differences between original and temporary file names, this > option could be useless. > > 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name > conversion, so the files will be evenly distributed. However this will cause > a lot of files placed in incorrect subvolumes, creating a lot of link files > until a rebalance is executed. > > Xavi > > > On 20/04/16 14:13, Serkan Çoban wrote: >> >> Here is the steps that I do in detail and relevant output from bricks: >> >> I am using below command for volume creation: >> gluster volume create v0 disperse 20 redundancy 4 \ >> 1.1.1.{185..204}:/bricks/02 \ >> 1.1.1.{205..224}:/bricks/02 \ >> 1.1.1.{225..244}:/bricks/02 \ >> 1.1.1.{185..204}:/bricks/03 \ >> 1.1.1.{205..224}:/bricks/03 \ >> 1.1.1.{225..244}:/bricks/03 \ >> 1.1.1.{185..204}:/bricks/04 \ >> 1.1.1.{205..224}:/bricks/04 \ >> 1.1.1.{225..244}:/bricks/04 \ >> 1.1.1.{185..204}:/bricks/05 \ >> 1.1.1.{205..224}:/bricks/05 \ >> 1.1.1.{225..244}:/bricks/05 \ >> 1.1.1.{185..204}:/bricks/06 \ >> 1.1.1.{205..224}:/bricks/06 \ >> 1.1.1.{225..244}:/bricks/06 \ >> 1.1.1.{185..204}:/bricks/07 \ >> 1.1.1.{205..224}:/bricks/07 \ >> 1.1.1.{225..244}:/bricks/07 \ >> 1.1.1.{185..204}:/bricks/08 \ >> 1.1.1.{205..224}:/bricks/08 \ >> 1.1.1.{225..244}:/bricks/08 \ >> 1.1.1.{185..204}:/bricks/09 \ >> 1.1.1.{205..224}:/bricks/09 \ >> 1.1.1.{225..244}:/bricks/09 \ >> 1.1.1.{185..204}:/bricks/10 \ >> 1.1.1.{205..224}:/bricks/10 \ >> 1.1.1.{225..244}:/bricks/10 \ >> 1.1.1.{185..204}:/bricks/11 \ >> 1.1.1.{205..224}:/bricks/11 \ >> 1.1.1.{225..244}:/bricks/11 \ >> 1.1.1.{185..204}:/bricks/12 \ >> 1.1.1.{205..224}:/bricks/12 \ >> 1.1.1.{225..244}:/bricks/12 \ >> 1.1.1.{185..204}:/bricks/13 \ >> 1.1.1.{205..224}:/bricks/13 \ >> 1.1.1.{225..244}:/bricks/13 \ >> 1.1.1.{185..204}:/bricks/14 \ >> 1.1.1.{205..224}:/bricks/14 \ >> 1.1.1.{225..244}:/bricks/14 \ >> 1.1.1.{185..204}:/bricks/15 \ >> 1.1.1.{205..224}:/bricks/15 \ >> 1.1.1.{225..244}:/bricks/15 \ >> 1.1.1.{185..204}:/bricks/16 \ >> 1.1.1.{205..224}:/bricks/16 \ >> 1.1.1.{225..244}:/bricks/16 \ >> 1.1.1.{185..204}:/bricks/17 \ >> 1.1.1.{205..224}:/bricks/17 \ >> 1.1.1.{225..244}:/bricks/17 \ >> 1.1.1.{185..204}:/bricks/18 \ >> 1.1.1.{205..224}:/bricks/18 \ >> 1.1.1.{225..244}:/bricks/18 \ >> 1.1.1.{185..204}:/bricks/19 \ >> 1.1.1.{205..224}:/bricks/19 \ >> 1.1.1.{225..244}:/bricks/19 \ >> 1.1.1.{185..204}:/bricks/20 \ >> 1.1.1.{205..224}:/bricks/20 \ >> 1.1.1.{225..244}:/bricks/20 \ >> 1.1.1.{185..204}:/bricks/21 \ >> 1.1.1.{205..224}:/bricks/21 \ >> 1.1.1.{225..244}:/bricks/21 \ >> 1.1.1.{185..204}:/bricks/22 \ >> 1.1.1.{205..224}:/bricks/22 \ >> 1.1.1.{225..244}:/bricks/22 \ >> 1.1.1.{185..204}:/bricks/23 \ >> 1.1.1.{205..224}:/bricks/23 \ >> 1.1.1.{225..244}:/bricks/23 \ >> 1.1.1.{185..204}:/bricks/24 \ >> 1.1.1.{205..224}:/bricks/24 \ >> 1.1.1.{225..244}:/bricks/24 \ >> 1.1.1.{185..204}:/bricks/25 \ >> 1.1.1.{205..224}:/bricks/25 \ >> 1.1.1.{225..244}:/bricks/25 \ >> 1.1.1.{185..204}:/bricks/26 \ >> 1.1.1.{205..224}:/bricks/26 \ >> 1.1.1.{225..244}:/bricks/26 \ >> 1.1.1.{185..204}:/bricks/27 \ >> 1.1.1.{205..224}:/bricks/27 \ >> 1.1.1.{225..244}:/bricks/27 force >> >> then I mount volume on 50 clients: >> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster >> >> then I make a directory from one of the clients and chmod it. >> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1 >> >> then I start distcp on clients, there are 1059X8.8GB files in one folder >> and >> they will be copied to /mnt/gluster/s1 with 100 parallel which means 2 >> copy jobs per client at same time. >> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb >> file:///mnt/gluster/s1 >> >> After job finished here is the status of s1 directory from bricks: >> s1 directory is present in all 1560 brick. >> s1/teragen-10tb folder is present in all 1560 brick. >> >> full listing of files in bricks: >> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0 >> >> You can ignore the .crc files in the brick output above, they are >> checksum files... >> >> As you can see part-m-xxxx files written only some bricks in nodes >> 0205..0224 >> All bricks have some files but they have zero size. >> >> I increase file descriptors to 65k so it is not the issue... >> >> >> >> >> >> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez <[email protected]> >> wrote: >>> >>> Hi Serkan, >>> >>> On 19/04/16 15:16, Serkan Çoban wrote: >>>>>>> >>>>>>> >>>>>>> I assume that gluster is used to store the intermediate files before >>>>>>> the reduce phase >>>> >>>> >>>> Nope, gluster is the destination for distcp command. hadoop distcp -m >>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster >>>> This run maps on datanodes which have /mnt/gluster mounted on all of >>>> them. >>> >>> >>> >>> I don't know hadoop, so I'm of little help here. However it seems that -m >>> 50 >>> means to execute 50 copies in parallel. This means that even if the >>> distribution worked fine, at most 50 (much probably less) of the 78 ec >>> sets >>> would be used in parallel. >>> >>>> >>>>>>> This means that this is caused by some peculiarity of the mapreduce. >>>> >>>> >>>> Yes but how a client write 500 files to gluster mount and those file >>>> just written only to subset of subvolumes? I cannot use gluster as a >>>> backup cluster if I cannot write with distcp. >>>> >>> >>> All 500 files were created only on one of the 78 ec sets and the >>> remaining >>> 77 got empty ? >>> >>>>>>> You should look which files are created in each brick and how many >>>>>>> while the process is running. >>>> >>>> >>>> Files only created on nodes 185..204 or 205..224 or 225..244. Only on >>>> 20 nodes in each test. >>> >>> >>> >>> How many files there were in each brick ? >>> >>> Not sure if this can be related, but standard linux distributions have a >>> default limit of 1024 open file descriptors. Having a so big volume and >>> doing a massive copy, maybe this limit is affecting something ? >>> >>> Are there any error or warning messages in the mount or bricks logs ? >>> >>> >>> Xavi >>> >>>> >>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez >>>> <[email protected]> >>>> wrote: >>>>> >>>>> >>>>> Hi Serkan, >>>>> >>>>> moved to gluster-users since this doesn't belong to devel list. >>>>> >>>>> On 19/04/16 11:24, Serkan Çoban wrote: >>>>>> >>>>>> >>>>>> >>>>>> I am copying 10.000 files to gluster volume using mapreduce on >>>>>> clients. Each map process took one file at a time and copy it to >>>>>> gluster volume. >>>>> >>>>> >>>>> >>>>> >>>>> I assume that gluster is used to store the intermediate files before >>>>> the >>>>> reduce phase. >>>>> >>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I >>>>>> copy >78 files parallel I expect each file goes to different subvolume >>>>>> right? >>>>> >>>>> >>>>> >>>>> >>>>> If you only copy 78 files, most probably you will get some subvolume >>>>> empty >>>>> and some other with more than one or two files. It's not an exact >>>>> distribution, it's a statistially balanced distribution: over time and >>>>> with >>>>> enough files, each brick will contain an amount of files in the same >>>>> order >>>>> of magnitude, but they won't have the *same* number of files. >>>>> >>>>>> In my tests during tests with fio I can see every file goes to >>>>>> different subvolume, but when I start mapreduce process from clients >>>>>> only 78/3=26 subvolumes used for writing files. >>>>> >>>>> >>>>> >>>>> >>>>> This means that this is caused by some peculiarity of the mapreduce. >>>>> >>>>>> I see that clearly from network traffic. Mapreduce on client side can >>>>>> be run multi thread. I tested with 1-5-10 threads on each client but >>>>>> every time only 26 subvolumes used. >>>>>> How can I debug the issue further? >>>>> >>>>> >>>>> >>>>> >>>>> You should look which files are created in each brick and how many >>>>> while >>>>> the >>>>> process is running. >>>>> >>>>> Xavi >>>>> >>>>> >>>>>> >>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Serkan, >>>>>>> >>>>>>> On 19/04/16 09:18, Serkan Çoban wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same >>>>>>>> behavior. >>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce to >>>>>>>> gluster >>>>>>>> using one thread per server and they are using only 20 servers out >>>>>>>> of >>>>>>>> 60. On the other hand fio tests use all the servers. Anything I can >>>>>>>> do >>>>>>>> to solve the issue? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Distribution of files to ec sets is done by dht. In theory if you >>>>>>> create >>>>>>> many files each ec set will receive the same amount of files. However >>>>>>> when >>>>>>> the number of files is small enough, statistics can fail. >>>>>>> >>>>>>> Not sure what you are doing exactly, but a mapreduce procedure >>>>>>> generally >>>>>>> only creates a single output. In that case it makes sense that only >>>>>>> one >>>>>>> ec >>>>>>> set is used. If you want to use all ec sets for a single file, you >>>>>>> should >>>>>>> enable sharding (I haven't tested that) or split the result in >>>>>>> multiple >>>>>>> files. >>>>>>> >>>>>>> Xavi >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Serkan >>>>>>>> >>>>>>>> >>>>>>>> ---------- Forwarded message ---------- >>>>>>>> From: Serkan Çoban <[email protected]> >>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM >>>>>>>> Subject: disperse volume file to subvolume mapping >>>>>>>> To: Gluster Users <[email protected]> >>>>>>>> >>>>>>>> >>>>>>>> Hi, I have a problem where clients are using only 1/3 of nodes in >>>>>>>> disperse volume for writing. >>>>>>>> I am testing from 50 clients using 1 to 10 threads with file names >>>>>>>> part-0-xxxx. >>>>>>>> What I see is clients only use 20 nodes for writing. How is the file >>>>>>>> name to sub volume hashing is done? Is this related to file names >>>>>>>> are >>>>>>>> similar? >>>>>>>> >>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse >>>>>>>> volume >>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during writes.. >>>>>>>> >>>>>>> >>>>> >>> > _______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
