Hi George, You can try your tests in two scenarios:
1. Switch off following performance translators: a. write-behind b. quick-read c. readdir-ahead Once you see that the issue goes away in the above scenario, you can apply write-behind fix [1] and turn write-behind on. We have a fix in quick-read at [2]. You can also try to test with the quick-read fix and turning quick-read on. However, since your issue is stale stat I doubt quick-read is not the culprit here (The issue with quick-read is stale data). We see that readdir-ahead is not updating/flushing the stats stored along with dentries when write or file modifications happen. That seem to be the issue with readdir-ahead. For now, during both of the above two scenarios, please keep readdir-ahead off. [1] http://review.gluster.org/15757/2 [2] http://review.gluster.org/#/c/15759/2 regards, Raghavendra On Tue, Nov 1, 2016 at 7:55 AM, Pranith Kumar Karampuri <pkara...@redhat.com > wrote: > > > On Tue, Nov 1, 2016 at 7:32 AM, Lian, George (Nokia - CN/Hangzhou) < > george.l...@nokia.com> wrote: > >> Hi, >> >> >> >> I will test it with your patches and update to you when I have result. >> > > hi George, > Please use http://review.gluster.org/#/c/15757/2 i.e. second > version of Raghavendra's patch. I tested it and it worked fine. We are > still trying to figure out quick-read and readdir-ahead as I type this mail. > > >> >> Thanks a lots >> >> >> >> Best Regards, >> >> George >> >> >> >> *From:* Pranith Kumar Karampuri [mailto:pkara...@redhat.com] >> *Sent:* Monday, October 31, 2016 11:23 AM >> *To:* Lian, George (Nokia - CN/Hangzhou) <george.l...@nokia.com> >> *Cc:* Raghavendra Gowdappa <rgowd...@redhat.com>; Zhang, Bingxuan (Nokia >> - CN/Hangzhou) <bingxuan.zh...@nokia.com>; Gluster-devel@gluster.org; >> Zizka, Jan (Nokia - CZ/Prague) <jan.zi...@nokia.com> >> >> *Subject:* Re: [Gluster-devel] Issue about the size of fstat is less >> than the really size of the syslog file >> >> >> >> Removing i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com, it is causing >> mail delivery problems for me. >> >> George, >> >> Raghavendra and I made some progress on this issue. We were in >> parallel working on another issue which is similar where elastic search >> indices are getting corrupted because of wrong stat sizes in our opinion. >> So I have been running different translator stacks in identifying the >> problematic xlators which are leading to indices corruption. >> >> We found the list to be 1) Write-behind, 2) Quick-read, 3) >> Readdir-ahead. Raghavendra and I just had a chat and we are suspecting that >> lack of lookup/readdirp implementation in write-behind could be the reason >> for this problem. Similar problems may exist in other two xlators too. But >> we are working on write-behind with priority. >> >> Our theory is this: >> >> If we do a 4KB write for example and it is cached in write-behind and we >> do a lookup on the file/do a readdirp on the directory with this file we >> send out wrong stat value to the kernel. There are different caches between >> kernel and gluster which may lead to fstat never coming till write-behind. >> So we need to make sure that we don't get into this situation. >> >> Action items: >> >> At the moment Raghavendra is working on a patch to implement >> lookup/readdirp in write-behind. I am going to test the same for elastic >> search. Will it be possible for you to test your application against the >> same patch and confirm that the patch fixes the problem? >> >> >> >> On Fri, Oct 28, 2016 at 12:08 PM, Pranith Kumar Karampuri < >> pkara...@redhat.com> wrote: >> >> hi George, >> >> It would help if we can identify the bare minimum xlators which >> are contributing to the issue like Raghavendra was mentioning earlier. We >> were wondering if it is possible for you to help us in identifying the >> issue by running the workload on a modified setup? We can suggest testing >> out using custom volfiles so that we can slowly build the graph which could >> be causing this issue. We would like you guys to try out this problem with >> just posix-xlator and fuse and nothing else. >> >> >> >> On Thu, Oct 27, 2016 at 1:40 PM, Lian, George (Nokia - CN/Hangzhou) < >> george.l...@nokia.com> wrote: >> >> Hi, Raghavendra, >> >> Could you please give some suggestion for this issue? we try to find the >> clue for this issue for a long time, but it has no progress:( >> >> Thanks & Best Regards, >> George >> >> -----Original Message----- >> From: Lian, George (Nokia - CN/Hangzhou) >> Sent: Wednesday, October 19, 2016 4:40 PM >> To: 'Raghavendra Gowdappa' <rgowd...@redhat.com> >> Cc: Gluster-devel@gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS < >> i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>; Zhang, Bingxuan >> (Nokia - CN/Hangzhou) <bingxuan.zh...@nokia.com>; Zizka, Jan (Nokia - >> CZ/Prague) <jan.zi...@nokia.com> >> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than >> the really size of the syslog file >> >> Hi, Raghavendra >> >> Just now, we test it with glusterfs log with debug-level "TRACE", and let >> some application trigger "glusterfs" produce large log, in that case, when >> we set write-behind and stat-prefetch both OFF, >> Tail the glusterfs log such like mnt-{VOLUME-NAME}.log, it still failed >> with "file truncated", >> >> So that means if file's IO in huge amount, the issue will still be there >> even write-behind and stat-prefetch both OFF. >> >> Best Regards, >> George >> >> -----Original Message----- >> From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com] >> >> Sent: Wednesday, October 19, 2016 2:54 PM >> To: Lian, George (Nokia - CN/Hangzhou) <george.l...@nokia.com> >> Cc: Gluster-devel@gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS < >> i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>; Zhang, Bingxuan >> (Nokia - CN/Hangzhou) <bingxuan.zh...@nokia.com>; Zizka, Jan (Nokia - >> CZ/Prague) <jan.zi...@nokia.com> >> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than >> the really size of the syslog file >> >> >> >> ----- Original Message ----- >> > From: "George Lian (Nokia - CN/Hangzhou)" <george.l...@nokia.com> >> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> >> > Cc: Gluster-devel@gluster.org, "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS" >> > <i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>, "Bingxuan Zhang >> (Nokia - CN/Hangzhou)" >> > <bingxuan.zh...@nokia.com>, "Jan Zizka (Nokia - CZ/Prague)" < >> jan.zi...@nokia.com> >> > Sent: Wednesday, October 19, 2016 12:05:01 PM >> > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than >> the really size of the syslog file >> > >> > Hi, Raghavendra, >> > >> > Thanks a lots for your quickly update! >> > In my case, there are so many process(write) is writing to the syslog >> file, >> > it do involve the writer is in the same host and writing in same mount >> point >> > while the tail(reader) is reading it. >> > >> > The bug I just guess is: >> > When a writer write the data with write-behind, it call the call-back >> > function " mdc_writev_cbk" and called "mdc_inode_iatt_set_validate" to >> > validate the "iatt" data, but with the code I mentioned last mail, it do >> > nothing. >> >> mdc_inode_iatt_set_validate has following code >> >> <snippet> >> if (!iatt || !iatt->ia_ctime) { >> mdc->ia_time = 0; >> goto unlock; >> } >> </snippet> >> >> Which means a NULL iatt sets mdc->ia_time to 0. This results in >> subsequent lookup/stat calls to be NOT served from md-cache. Instead, the >> stat is served from backend bricks. So, I don't see an issue here. >> >> However, one case where a NULL iatt is different from a valid iatt (which >> differs from the value stored in md-cache) is that the latter results in a >> call to inode_invalidate. This invalidation propagates to kernel and all >> dentry and page cache corresponding to file is purged. So, I am suspecting >> whether the stale stat you saw was served from kernel cache (not from >> glusterfs). If this is the case, having mount options "attribute-timeout=0" >> and "entry-timeout=0" should've helped. >> >> I am still at loss to point out the RCA for this issue. >> >> >> > And in same time, the reader(tail) read the "iatt" data, but in case of >> the >> > cache-time is not timeout, it will return the "iatt" data without the >> last >> > change. >> > >> > Do your think it is a possible bug? >> > >> > Thanks & Best Regards, >> > George >> > >> > -----Original Message----- >> > From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com] >> > Sent: Wednesday, October 19, 2016 2:06 PM >> > To: Lian, George (Nokia - CN/Hangzhou) <george.l...@nokia.com> >> > Cc: Gluster-devel@gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS >> > <i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>; Zhang, Bingxuan >> (Nokia - >> > CN/Hangzhou) <bingxuan.zh...@nokia.com>; Zizka, Jan (Nokia - CZ/Prague) >> > <jan.zi...@nokia.com> >> > Subject: Re: [Gluster-devel] Issue about the size of fstat is less than >> the >> > really size of the syslog file >> > >> > >> > >> > ----- Original Message ----- >> > > From: "George Lian (Nokia - CN/Hangzhou)" <george.l...@nokia.com> >> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> >> > > Cc: Gluster-devel@gluster.org, "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS" >> > > <i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>, "Bingxuan Zhang >> (Nokia >> > > - CN/Hangzhou)" >> > > <bingxuan.zh...@nokia.com>, "Jan Zizka (Nokia - CZ/Prague)" >> > > <jan.zi...@nokia.com> >> > > Sent: Wednesday, October 19, 2016 10:51:24 AM >> > > Subject: RE: [Gluster-devel] Issue about the size of fstat is less >> than the >> > > really size of the syslog file >> > > >> > > Hi, Raghavendra, >> > > >> > > When we disable md-cache(gluster volume set log >> > > performance.md-cache-timeout >> > > 0), the issue seems gone. >> > > (we can't disable with " gluster volume set log performance.md-cache >> off" >> > > why?) >> > >> > Please use >> > #gluster volume set log performance.stat-prefetch off >> > >> > > >> > > So I double confuse that the code I abstract in last mail maybe have >> some >> > > issue for this case. >> > > Could you please share your comments? >> > >> > Please find my comments below. >> > >> > > >> > > Thanks & Best Regards, >> > > George >> > > >> > > -----Original Message----- >> > > From: Lian, George (Nokia - CN/Hangzhou) >> > > Sent: Friday, October 14, 2016 1:44 PM >> > > To: 'Raghavendra Gowdappa' <rgowd...@redhat.com> >> > > Cc: Gluster-devel@gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS >> > > <i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>; Zhang, Bingxuan >> (Nokia >> > > - >> > > CN/Hangzhou) <bingxuan.zh...@nokia.com>; Zizka, Jan (Nokia - >> CZ/Prague) >> > > <jan.zi...@nokia.com> >> > > Subject: RE: [Gluster-devel] Issue about the size of fstat is less >> than the >> > > really size of the syslog file >> > > >> > > Hi, Raghavendra, >> > > >> > > Our version of GlusterFS is 3.6.9, and I also check the newest code >> of main >> > > branch, the function of " mdc_inode_iatt_set_validate" is almost >> same, from >> > > the following code of this function, >> > > We could see a "TODO" comments inline, does it mean if we enhance >> > > write-behind feature, the "iatt" field in callback will be NULL, so >> that >> > > inode_invalidate will not be called? So the size of file will not >> update >> > > since "write behind" enabled ? >> > > Is it the root cause for "tail" application failed with "file >> truncated" >> > > issue ? >> > > >> > > LOCK (&mdc->lock); >> > > { >> > > if (!iatt || !iatt->ia_ctime) { >> > > mdc->ia_time = 0; >> > > goto unlock; >> > > } >> > > >> > > /* >> > > * Invalidate the inode if the mtime or ctime has changed >> > > * and the prebuf doesn't match the value we have cached. >> > > * TODO: writev returns with a NULL iatt due to >> > > * performance/write-behind, causing invalidation on >> writes. >> > > */ >> > >> > The issue explained in this comment is hit only when writes are done. >> But, in >> > your use-case only "tail" is the application running on the mount (If I >> am >> > not wrong, the writer is running on a different mountpoint). So, I >> doubt >> > you are hitting this issue. But, you are saying that the issue goes away >> > when write-behind/md-cache is turned off pointing to some interaction >> > between md-cache and write-behind causing the issue. I need more time to >> > look into this issue. Can you file a bug on this? >> > >> > > if (IA_ISREG(inode->ia_type) && >> > > ((iatt->ia_mtime != mdc->md_mtime) || >> > > (iatt->ia_ctime != mdc->md_ctime))) >> > > if (!prebuf || (prebuf->ia_ctime != >> mdc->md_ctime) || >> > > (prebuf->ia_mtime != mdc->md_mtime)) >> > > inode_invalidate(inode); >> > > >> > > mdc_from_iatt (mdc, iatt); >> > > >> > > time (&mdc->ia_time); >> > > } >> > > >> > > Best Regards, >> > > George >> > > -----Original Message----- >> > > From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com] >> > > Sent: Thursday, October 13, 2016 8:58 PM >> > > To: Lian, George (Nokia - CN/Hangzhou) <george.l...@nokia.com> >> > > Cc: Gluster-devel@gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS >> > > <i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>; Zhang, Bingxuan >> (Nokia >> > > - >> > > CN/Hangzhou) <bingxuan.zh...@nokia.com>; Zizka, Jan (Nokia - >> CZ/Prague) >> > > <jan.zi...@nokia.com> >> > > Subject: Re: [Gluster-devel] Issue about the size of fstat is less >> than the >> > > really size of the syslog file >> > > >> > > >> > > >> > > ----- Original Message ----- >> > > > From: "George Lian (Nokia - CN/Hangzhou)" <george.l...@nokia.com> >> > > > To: Gluster-devel@gluster.org >> > > > Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS" >> > > > <i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>, "Bingxuan Zhang >> > > > (Nokia >> > > > - CN/Hangzhou)" <bingxuan.zh...@nokia.com>, "Jan Zizka (Nokia - >> > > > CZ/Prague)" >> > > > <jan.zi...@nokia.com> >> > > > Sent: Thursday, October 13, 2016 2:33:53 PM >> > > > Subject: [Gluster-devel] Issue about the size of fstat is less than >> the >> > > > really size of the syslog file >> > > > >> > > > Hi, Dear Expert, >> > > > We have use glusterfs as a network filesystem, and syslog store in >> there, >> > > > some clients on different host may write the syslog file via >> “glusterfs” >> > > > mount point. >> > > > Now we encounter an issue when we “tail” the syslog file, it will >> > > > occasional >> > > > failed with error “ file truncated ” >> > > > As we study and trace with the “tail” source code, it failed with >> the >> > > > following code: >> > > > if ( S_ISREG (mode) && stats.st_size < f[i].size ) >> > > > { >> > > > error (0, 0, _("%s: file truncated"), quotef (name)); >> > > > /* Assume the file was truncated to 0, >> > > > and therefore output all "new" data. */ >> > > > xlseek (fd, 0, SEEK_SET, name); >> > > > f[i].size = 0; >> > > > } >> > > > When stats.st_size < f[i].size, what mean the size report by fstat >> is >> > > > less >> > > > than “tail” had read, it lead to “file truncated”, we also use >> “strace” >> > > > tools to trace the tail application, the related tail strace log as >> the >> > > > below: >> > > > nanosleep({1, 0}, NULL) = 0 >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0 >> > > > nanosleep({1, 0}, NULL) = 0 >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0 >> > > > nanosleep({1, 0}, NULL) = 0 >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0 >> > > > nanosleep({1, 0}, NULL) = 0 >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544549, ...}) = 0 >> > > > read(3, " Data … -"..., 8192) = 1444 >> > > > read(3, " Data.. "..., 8192) = 720 >> > > > read(3, "", 8192) = 0 >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544789, ...}) = 0 >> > > > write(1, “DATA…..” ) = 2164 >> > > > write(2, "tail: ", 6tail: ) = 6 >> > > > write(2, "/mnt/log/master/syslog: file tru"..., >> 38/mnt/log/master/syslog: >> > > > file truncated) = 38 >> > > > as the above strace log, tail has read 1444+720=2164 bytes, >> > > > but fstat tell “tail” 192544789 – 192543105 = 1664 which less than >> 2164, >> > > > so >> > > > it lead to “tail” application “file truncated”. >> > > > And if we turn off “write-behind” feature, the issue will not be >> > > > reproduced >> > > > any more. >> > > >> > > That seems strange. There are no writes happening on the fd/inode >> through >> > > which tail is reading/stating from. So, it seems strange that >> write-behind >> > > is involved here. I suspect whether any of >> md-cache/read-ahead/io-cache is >> > > causing the issue. Can you, >> > > >> > > 1. Turn off md-cache, read-ahead, io-cache xlators >> > > 2. mount glusterfs with --attribute-timeout=0 >> > > 3. set write-behind on >> > > >> > > and rerun the tests? If you don't hit the issue, you can experiment by >> > > turning on/off of md-cache, read-ahead and io-cache translators and >> see >> > > what >> > > are the minimal number of xlators that need to be turned off to not >> hit the >> > > issue (with write-behind on)? >> > > >> > > regards, >> > > Raghavendra >> > > >> > > > So we think it may be related to cache consistence issue due to >> > > > performance >> > > > consider, but we still have concern that: >> > > > The syslog file is used only with “Append” mode, so the size of file >> > > > shouldn’t be reduced, when a client read the file, why “fstat” can’t >> > > > return >> > > > the really size match to the cache? >> > > > From current investigation, we doubt that the current implement of >> > > > “glusterfs” has a bug on “fstat” when cache is on. >> > > > Your comments is our highly appreciated! >> > > > Thanks & Best Regards >> > > > George >> > > > >> > > > _______________________________________________ >> > > > Gluster-devel mailing list >> > > > Gluster-devel@gluster.org >> > > > http://www.gluster.org/mailman/listinfo/gluster-devel >> > > >> > >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-devel >> >> >> >> -- >> >> Pranith >> >> >> >> >> -- >> >> Pranith >> > > > > -- > Pranith > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > -- Raghavendra G
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel