Amar, Thank you for helping me troubleshoot the issues. I don't have the resources to test the software at this point, but I will keep it in mind.
Regards, Dmitry On Tue, Jan 22, 2019 at 1:02 AM Amar Tumballi Suryanarayan < [email protected]> wrote: > Dmitry, > > Thanks for the detailed updates on this thread. Let us know how your > 'production' setup is running. For much smoother next upgrade, we request > you to help out with some early testing of glusterfs-6 RC builds which are > expected to be out by Feb 1st week. > > Also, if it is possible for you to automate the tests, it would be great > to have it in our regression, so we can always be sure your setup would > never break in future releases. > > Regards, > Amar > > On Mon, Jan 7, 2019 at 11:42 PM Dmitry Isakbayev <[email protected]> > wrote: > >> This system is going into production. I will try to replicate this >> problem on the next installation. >> >> On Wed, Jan 2, 2019 at 9:25 PM Raghavendra Gowdappa <[email protected]> >> wrote: >> >>> >>> >>> On Wed, Jan 2, 2019 at 9:59 PM Dmitry Isakbayev <[email protected]> >>> wrote: >>> >>>> Still no JVM crushes. Is it possible that running glusterfs with >>>> performance options turned off for a couple of days cleared out the "stale >>>> metadata issue"? >>>> >>> >>> restarting these options, would've cleared the existing cache and hence >>> previous stale metadata would've been cleared. Hitting stale metadata >>> again depends on races. That might be the reason you are still not seeing >>> the issue. Can you try with enabling all perf xlators (default >>> configuration)? >>> >>> >>>> >>>> On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev <[email protected]> >>>> wrote: >>>> >>>>> The software ran with all of the options turned off over the weekend >>>>> without any problems. >>>>> I will try to collect the debug info for you. I have re-enabled the 3 >>>>> three options, but yet to see the problem reoccurring. >>>>> >>>>> >>>>> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks Dmitry. Can you provide the following debug info I asked >>>>>> earlier: >>>>>> >>>>>> * strace -ff -v ... of java application >>>>>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse >>>>>> while mounting). >>>>>> >>>>>> regards, >>>>>> Raghavendra >>>>>> >>>>>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> These 3 options seem to trigger both (reading zip file and renaming >>>>>>> files) problems. >>>>>>> >>>>>>> Options Reconfigured: >>>>>>> performance.io-cache: off >>>>>>> performance.stat-prefetch: off >>>>>>> performance.quick-read: off >>>>>>> performance.parallel-readdir: off >>>>>>> *performance.readdir-ahead: on* >>>>>>> *performance.write-behind: on* >>>>>>> *performance.read-ahead: on* >>>>>>> performance.client-io-threads: off >>>>>>> nfs.disable: on >>>>>>> transport.address-family: inet >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Turning a single option on at a time still worked fine. I will >>>>>>>> keep trying. >>>>>>>> >>>>>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues or >>>>>>>> log messages. Do you suppose these issues are triggered by the new >>>>>>>> environment or did not exist in 4.1.5? >>>>>>>> >>>>>>>> [root@node1 ~]# glusterfs --version >>>>>>>> glusterfs 4.1.5 >>>>>>>> >>>>>>>> On AWS using >>>>>>>> [root@node1 ~]# hostnamectl >>>>>>>> Static hostname: node1 >>>>>>>> Icon name: computer-vm >>>>>>>> Chassis: vm >>>>>>>> Machine ID: b30d0f2110ac3807b210c19ede3ce88f >>>>>>>> Boot ID: 52bb159a0aa94043a40e7c7651967bd9 >>>>>>>> Virtualization: kvm >>>>>>>> Operating System: CentOS Linux 7 (Core) >>>>>>>> CPE OS Name: cpe:/o:centos:centos:7 >>>>>>>> Kernel: Linux 3.10.0-862.3.2.el7.x86_64 >>>>>>>> Architecture: x86-64 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Ok. I will try different options. >>>>>>>>>> >>>>>>>>>> This system is scheduled to go into production soon. What >>>>>>>>>> version would you recommend to roll back to? >>>>>>>>>> >>>>>>>>> >>>>>>>>> These are long standing issues. So, rolling back may not make >>>>>>>>> these issues go away. Instead if you think performance is agreeable >>>>>>>>> to you, >>>>>>>>> please keep these xlators off in production. >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Raghavendra, >>>>>>>>>>>> >>>>>>>>>>>> Thank for the suggestion. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I am suing >>>>>>>>>>>> >>>>>>>>>>>> [root@jl-fanexoss1p glusterfs]# gluster --version >>>>>>>>>>>> glusterfs 5.0 >>>>>>>>>>>> >>>>>>>>>>>> On >>>>>>>>>>>> [root@jl-fanexoss1p glusterfs]# hostnamectl >>>>>>>>>>>> Icon name: computer-vm >>>>>>>>>>>> Chassis: vm >>>>>>>>>>>> Machine ID: e44b8478ef7a467d98363614f4e50535 >>>>>>>>>>>> Boot ID: eed98992fdda4c88bdd459a89101766b >>>>>>>>>>>> Virtualization: vmware >>>>>>>>>>>> Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) >>>>>>>>>>>> CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server >>>>>>>>>>>> Kernel: Linux 3.10.0-862.14.4.el7.x86_64 >>>>>>>>>>>> Architecture: x86-64 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I have configured the following options >>>>>>>>>>>> >>>>>>>>>>>> [root@jl-fanexoss1p glusterfs]# gluster volume info >>>>>>>>>>>> Volume Name: gv0 >>>>>>>>>>>> Type: Replicate >>>>>>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824 >>>>>>>>>>>> Status: Started >>>>>>>>>>>> Snapshot Count: 0 >>>>>>>>>>>> Number of Bricks: 1 x 3 = 3 >>>>>>>>>>>> Transport-type: tcp >>>>>>>>>>>> Bricks: >>>>>>>>>>>> Brick1: jl-fanexoss1p.cspire.net:/data/brick1/gv0 >>>>>>>>>>>> Brick2: sl-fanexoss2p.cspire.net:/data/brick1/gv0 >>>>>>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0 >>>>>>>>>>>> Options Reconfigured: >>>>>>>>>>>> performance.io-cache: off >>>>>>>>>>>> performance.stat-prefetch: off >>>>>>>>>>>> performance.quick-read: off >>>>>>>>>>>> performance.parallel-readdir: off >>>>>>>>>>>> performance.readdir-ahead: off >>>>>>>>>>>> performance.write-behind: off >>>>>>>>>>>> performance.read-ahead: off >>>>>>>>>>>> performance.client-io-threads: off >>>>>>>>>>>> nfs.disable: on >>>>>>>>>>>> transport.address-family: inet >>>>>>>>>>>> >>>>>>>>>>>> I don't know if it is related, but I am seeing a lot of >>>>>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031] >>>>>>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: >>>>>>>>>>>> remote >>>>>>>>>>>> operation failed [No such device or address] >>>>>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to >>>>>>>>>>>> dispatch >>>>>>>>>>>> handler >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> These msgs were introduced by patch [1]. To the best of my >>>>>>>>>>> knowledge they are benign. We'll be sending a patch to fix these >>>>>>>>>>> msgs >>>>>>>>>>> though. >>>>>>>>>>> >>>>>>>>>>> +Mohit Agrawal <[email protected]> +Milind Changire >>>>>>>>>>> <[email protected]> . Can you try to identify why we are >>>>>>>>>>> seeing these messages? If possible please send a patch to fix this. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> And java.io exceptions trying to rename files. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> When you see the errors is it possible to collect, >>>>>>>>>>> * strace of the java application (strace -ff -v ...) >>>>>>>>>>> * fuse-dump of the glusterfs mount (use option --dump-fuse while >>>>>>>>>>> mounting)? >>>>>>>>>>> >>>>>>>>>>> I also need another favour from you. By trail and error, can you >>>>>>>>>>> point out which of the many performance xlators you've turned off is >>>>>>>>>>> causing the issue? >>>>>>>>>>> >>>>>>>>>>> The above two data-points will help us to fix the problem. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Thank You, >>>>>>>>>>>> Dmitry >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra Gowdappa < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> What version of glusterfs are you using? It might be either >>>>>>>>>>>>> * a stale metadata issue. >>>>>>>>>>>>> * inconsistent ctime issue. >>>>>>>>>>>>> >>>>>>>>>>>>> Can you try turning off all performance xlators? If the issue >>>>>>>>>>>>> is 1, that should help. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry Isakbayev < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Attempted to set 'performance.read-ahead off` according to >>>>>>>>>>>>>> https://jira.apache.org/jira/browse/AMQ-7041 >>>>>>>>>>>>>> That did not help. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry Isakbayev < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The core file generated by JVM suggests that it happens >>>>>>>>>>>>>>> because the file is changing while it is being read - >>>>>>>>>>>>>>> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557 >>>>>>>>>>>>>>> . >>>>>>>>>>>>>>> The application reads in the zipfile and goes through the >>>>>>>>>>>>>>> zip entries, then reloads the file and goes the zip entries >>>>>>>>>>>>>>> again. It does >>>>>>>>>>>>>>> so 3 times. The application never crushes on the 1st cycle but >>>>>>>>>>>>>>> sometimes >>>>>>>>>>>>>>> crushes on the 2nd or 3rd cycle. >>>>>>>>>>>>>>> The zip file is generated about 20 seconds prior to it being >>>>>>>>>>>>>>> used and is not updated or even used by any other application. >>>>>>>>>>>>>>> I have >>>>>>>>>>>>>>> never seen this problem on a plain file system. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would appreciate any suggestions on how to go debugging >>>>>>>>>>>>>>> this issue. I can change the source code of the java >>>>>>>>>>>>>>> application. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> Dmitry >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >> Gluster-users mailing list >> [email protected] >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) >
_______________________________________________ Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
