[Gluster-devel] Hello, I have a question about the erasure code translator, hope someone give me some advice, thank you!
Hi, I am a storage software coder who is interested in Gluster. I am trying to improve the read/write performance of it. I noticed that gluster is using Vandermonde matrix in erasure code encoding and decoding process. However, it is quite complicate to generate inverse matrix of a Vandermonde matrix, which is necessary for decode. The cost is O(n³). Use a Cauchy matrix, can greatly cut down the cost of the process to find an inverse matrix. Which is O(n²). I use intel storage accelerate library to replace the original ec encode/decode part of gluster. And it reduce the encode and decode time to about 50% of the original one. However, when I test the whole system. The read/write performance is almost the same as the original gluster. I test it on three machines as servers. Each one had two bricks, both of them are SSD. So the total amount of bricks is 6. Use two of them as coding bricks. That is a 4+2 disperse volume configure. The capability of network card is 1Mbps. Theoretically it can support read and write with the speed faster than 1000MB/s. The actually performance of read is about 492MB/s. The actually performance of write is about 336MB/s. While the original one read at 461MB/s, write at 322MB/s Is there someone who can give me some advice about how to improve its performance? Which part is the critical defect on its performance if it’s not the ec translator? I did a time count on translators. It show me EC translator just take 7% in the whole read\write process. Even though I knew that some translators are run asynchronous, so the real percentage can be some how lager than that. Sincerely thank you for your patient to read my question!___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] glusterd stuck for glusterfs with version 3.12.15
Hi glusterfs experts, Good day! In my test env, sometimes glusterd stuck issue happened, and it is not responding to any gluster commands, when I checked this issue I find that glusterd thread 9 and thread 8 is dealing with the same socket, I thought following patch should be able to solve this issue, however after I merged this patch this issue still exist. When I looked into this code, it seems socket_event_poll_in called event_handled before rpc_transport_pollin_destroy, I think this gives the chance for another poll for the exactly the same socket. And caused this glusterd stuck issue, also, I find there is no LOCK_DESTROY(&iobref->lock) In iobref_destroy, I think it is better to add destroy lock. Following is the gdb info when this issue happened, I would like to know your opinion on this issue, thanks! SHA-1: f747d55a7fd364e2b9a74fe40360ab3cb7b11537 * socket: fix issue on concurrent handle of a socket GDB INFO: Thread 8 is blocked on pthread_cond_wait, and thread 9 is blocked in iobref_unref, I think Thread 9 (Thread 0x7f9edf7fe700 (LWP 1933)): #0 0x7f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x7f9ee9fda657 in __lll_lock_elision () from /lib64/libpthread.so.0 #2 0x7f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at iobuf.c:944 #3 0x7f9eeafd2f29 in rpc_transport_pollin_destroy (pollin=0x7f9ed00452d0) at rpc-transport.c:123 #4 0x7f9ee4fbf319 in socket_event_poll_in (this=0x7f9ed0049cc0, notify_handled=_gf_true) at socket.c:2322 #5 0x7f9ee4fbf932 in socket_event_handler (fd=36, idx=27, gen=4, data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2471 #6 0x7f9eeb2825d4 in event_dispatch_epoll_handler (event_pool=0x17feb00, event=0x7f9edf7fde84) at event-epoll.c:583 #7 0x7f9eeb2828ab in event_dispatch_epoll_worker (data=0x180d0c0) at event-epoll.c:659 #8 0x7f9ee9fce5da in start_thread () from /lib64/libpthread.so.0 #9 0x7f9ee98a4eaf in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7f9ed700 (LWP 1932)): #0 0x7f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x7f9ee9fd2b42 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0 #2 0x7f9ee9fd44a8 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #3 0x7f9ee4fbadab in socket_event_poll_err (this=0x7f9ed0049cc0, gen=4, idx=27) at socket.c:1201 #4 0x7f9ee4fbf99c in socket_event_handler (fd=36, idx=27, gen=4, data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2480 #5 0x7f9eeb2825d4 in event_dispatch_epoll_handler (event_pool=0x17feb00, event=0x7f9edfffee84) at event-epoll.c:583 #6 0x7f9eeb2828ab in event_dispatch_epoll_worker (data=0x180cf20) at event-epoll.c:659 #7 0x7f9ee9fce5da in start_thread () from /lib64/libpthread.so.0 #8 0x7f9ee98a4eaf in clone () from /lib64/libc.so.6 (gdb) thread 9 [Switching to thread 9 (Thread 0x7f9edf7fe700 (LWP 1933))] #0 0x7f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x7f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x7f9ee9fda657 in __lll_lock_elision () from /lib64/libpthread.so.0 #2 0x7f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at iobuf.c:944 #3 0x7f9eeafd2f29 in rpc_transport_pollin_destroy (pollin=0x7f9ed00452d0) at rpc-transport.c:123 #4 0x7f9ee4fbf319 in socket_event_poll_in (this=0x7f9ed0049cc0, notify_handled=_gf_true) at socket.c:2322 #5 0x7f9ee4fbf932 in socket_event_handler (fd=36, idx=27, gen=4, data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2471 #6 0x7f9eeb2825d4 in event_dispatch_epoll_handler (event_pool=0x17feb00, event=0x7f9edf7fde84) at event-epoll.c:583 #7 0x7f9eeb2828ab in event_dispatch_epoll_worker (data=0x180d0c0) at event-epoll.c:659 #8 0x7f9ee9fce5da in start_thread () from /lib64/libpthread.so.0 #9 0x7f9ee98a4eaf in clone () from /lib64/libc.so.6 (gdb) frame 2 #2 0x7f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at iobuf.c:944 944 iobuf.c: No such file or directory. (gdb) print *iobref $1 = {lock = {spinlock = 2, mutex = {__data = {__lock = 2, __count = 222, __owner = -2120437760, __nusers = 1, __kind = 8960, __spins = 512, __elision = 0, __list = {__prev = 0x4000, __next = 0x7f9ed00063b000}}, __size = "\002\000\000\000\336\000\000\000\000\260\234\201\001\000\000\000\000#\000\000\000\002\000\000\000@\000\000\000\000\000\000\000\260c\000О\177", __align = 953482739714}}, ref = -256, iobrefs = 0x, alloced = -1, used = -1} (gdb) quit A debugging session is active. ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Weekly Untriaged Bugs
[...truncated 6 lines...] https://bugzilla.redhat.com/1688226 / core: Brick Still Died After Restart Glusterd & Glusterfsd Services https://bugzilla.redhat.com/1695416 / core: client log flooding with intentional socket shutdown message when a brick is down https://bugzilla.redhat.com/1691833 / core: Client sends 128KByte network packet for 0 length file copy https://bugzilla.redhat.com/1695480 / core: Global Thread Pool https://bugzilla.redhat.com/1694943 / core: parallel-readdir slows down directory listing https://bugzilla.redhat.com/1696721 / geo-replication: geo-replication failing after upgrade from 5.5 to 6.0 https://bugzilla.redhat.com/1694637 / geo-replication: Geo-rep: Rename to an existing file name destroys its content on slave https://bugzilla.redhat.com/1689981 / geo-replication: OSError: [Errno 1] Operation not permitted - failing with socket files? https://bugzilla.redhat.com/1694139 / glusterd: Error waiting for job 'heketi-storage-copy-job' to complete on one-node k3s deployment. https://bugzilla.redhat.com/1695099 / glusterd: The number of glusterfs processes keeps increasing, using all available resources https://bugzilla.redhat.com/1690454 / posix-acl: mount-shared-storage.sh does not implement mount options https://bugzilla.redhat.com/1696518 / project-infrastructure: builder203 does not have a valid hostname set https://bugzilla.redhat.com/1691617 / project-infrastructure: clang-scan tests are failing nightly. https://bugzilla.redhat.com/1691357 / project-infrastructure: core archive link from regression jobs throw not found error https://bugzilla.redhat.com/1692349 / project-infrastructure: gluster-csi-containers job is failing https://bugzilla.redhat.com/1693385 / project-infrastructure: request to change the version of fedora in fedora-smoke-job https://bugzilla.redhat.com/1693295 / project-infrastructure: rpc.statd not started on builder204.aws.gluster.org https://bugzilla.redhat.com/1691789 / project-infrastructure: rpc-statd service stops on AWS builders https://bugzilla.redhat.com/1695484 / project-infrastructure: smoke fails with "Build root is locked by another process" https://bugzilla.redhat.com/1693184 / replicate: A brick process(glusterfsd) died with 'memory violation' https://bugzilla.redhat.com/1696075 / replicate: Client lookup is unable to heal missing directory GFID entry https://bugzilla.redhat.com/1696633 / tests: GlusterFs v4.1.5 Tests from /tests/bugs/ module failing on Intel https://bugzilla.redhat.com/1694976 / unclassified: On Fedora 29 GlusterFS 4.1 repo has bad/missing rpm signs [...truncated 2 lines...] build.log Description: Binary data ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel