Re: Question about file descriptor swapping
Hi Stefan, while we had a discussion at Slack [1] (found by Jan at [2]) about the atomicity of „rename", could it be a similar problem here (Linux/qemu/fs stack)? In [2] they could workaround their problem with waiting some time after renaming? @Stefan, maybe you could try to wait some time after renaming/closing the db? Cheers, -Ronny [1] https://couchdb.slack.com/archives/C01TBE2J197/p1678355980122119 [2] https://toot.cat/@zkat/109973167110793372 > Am 13.03.2023 um 09:50 schrieb Stefan Kral : > > Hi Jan, > > here you go: https://github.com/emlix/couchdb-yocto > > the mentioned patch is here > https://github.com/emlix/couchdb-yocto/blob/main/meta-couchdb/recipes-core/couchdb/files/0001-swap-fds.patch > > when you run the comaction test (see README do get there) > /usr/lib/test-couchdb/test-compaction.sh > > you will find in the (/var/log/couchdb/couch.log) log as last line: > [debug] [<0.173.0>] before gen_server:call > > Thanks, > Stefan > > Am 02.03.23 um 13:45 schrieb Jan Lehnardt: >> Hi Stefan, >> >> Thanks for the additional info. I’m happy to try a yocto build here. >> >> Best >> Jan >> — >> >>> On 2. Mar 2023, at 12:24, Stefan Kral wrote: >>> >>> Hi, >>> >>> I can give you some background context: our CouchDB instance is running >>> on a embedded device (with minimal attack vector, so we have no pressure >>> to mitigate CVEs). CouchDB has been chosen because of its write append >>> and power fail safe property (and because of the easy scriptable >>> curl/json interface). >>> >>> Currently there is a production system running on a SMB1 share (mounted >>> in a Linux host) which works well (at least for our uses cases). SMB1 is >>> not logner the default on the Windows remote side. And SMB2/3 has an >>> issue with opening a renamend but not closed filedescriptor. The >>> question is, wether we can solve this issue with minimal changes. >>> 1. How did you verify that the gen_server:call/3 call never returns? 2. Do you get any pertinent lines (especially crashes) in your couch.log? >>> >>> by adding: >>> +?LOG_DEBUG("before gen_server:call", []), ok = gen_server:call(Db#db.main_pid, {db_updated, NewDb3}, infinity), +?LOG_DEBUG("after gen_server:call", []), >>> >>> the log gives: >>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.391.0>] Compaction process spawned for db "asdf" [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.84.0>] New task status for <0.391.0>: [{changes_done,1}, {database,<<"asdf">>}, {progress,100}, {started_on,1677753384}, {total_changes,1}, {type,database_compaction}, {updated_on,1677753384}] [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] CouchDB swapping files .../asdf.couch and .../asdf.couch.compact. [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] before gen_server:call >>> >>> then long time nothing... >>> >>> refreshing the db in the futon web gui gives: no response >>> >>> and the log continues with: >>> [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] ** Generic server couch_compaction_daemon terminating ** Last message in was {'EXIT',<0.145.0>, {timeout, {gen_server,call,[couch_server,get_server]}}} ** When Server state == {state,<0.145.0>} ** Reason for termination == ** {compaction_loop_died, {timeout,{gen_server,call,[couch_server,get_server]}}} [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] {error_report,<0.31.0>, {<0.144.0>,crash_report, [[{initial_call, {couch_compaction_daemon,init,['Argument__1']}}, {pid,<0.144.0>}, {registered_name,couch_compaction_daemon}, {error_info, {exit, {compaction_loop_died, {timeout, {gen_server,call,[couch_server,get_server]}}}, [{gen_server,terminate,7, [{file,"gen_server.erl"},{line,804}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,237}]}]}}, >>> ... >>> >>> 3. Can you share your environment where you get to compile 1.6.1 successfully, so we can try and reproduce this? >>> >>> I could prepare you a yocto setup to build a toolchain and packages for >>> an qemu/docker imgage, if you are familar with that build system... >>> 4. Could it be
Re: Question about file descriptor swapping
Hi Jan, here you go: https://github.com/emlix/couchdb-yocto the mentioned patch is here https://github.com/emlix/couchdb-yocto/blob/main/meta-couchdb/recipes-core/couchdb/files/0001-swap-fds.patch when you run the comaction test (see README do get there) /usr/lib/test-couchdb/test-compaction.sh you will find in the (/var/log/couchdb/couch.log) log as last line: [debug] [<0.173.0>] before gen_server:call Thanks, Stefan Am 02.03.23 um 13:45 schrieb Jan Lehnardt: > Hi Stefan, > > Thanks for the additional info. I’m happy to try a yocto build here. > > Best > Jan > — > >> On 2. Mar 2023, at 12:24, Stefan Kral wrote: >> >> Hi, >> >> I can give you some background context: our CouchDB instance is running >> on a embedded device (with minimal attack vector, so we have no pressure >> to mitigate CVEs). CouchDB has been chosen because of its write append >> and power fail safe property (and because of the easy scriptable >> curl/json interface). >> >> Currently there is a production system running on a SMB1 share (mounted >> in a Linux host) which works well (at least for our uses cases). SMB1 is >> not logner the default on the Windows remote side. And SMB2/3 has an >> issue with opening a renamend but not closed filedescriptor. The >> question is, wether we can solve this issue with minimal changes. >> >>> 1. How did you verify that the gen_server:call/3 call never returns? >>> 2. Do you get any pertinent lines (especially crashes) in your >>> couch.log? >> >> by adding: >> >>> +?LOG_DEBUG("before gen_server:call", []), >>> ok = gen_server:call(Db#db.main_pid, {db_updated, NewDb3}, >>> infinity), >>> +?LOG_DEBUG("after gen_server:call", []), >> >> the log gives: >> >>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.391.0>] Compaction process >>> spawned for db "asdf" >>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.84.0>] New task status for >>> <0.391.0>: [{changes_done,1}, >>> {database,<<"asdf">>}, >>> {progress,100}, >>> {started_on,1677753384}, >>> {total_changes,1}, >>> >>> {type,database_compaction}, >>> {updated_on,1677753384}] >>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] CouchDB swapping files >>> .../asdf.couch and .../asdf.couch.compact. >>> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] before gen_server:call >> >> then long time nothing... >> >> refreshing the db in the futon web gui gives: no response >> >> and the log continues with: >> >>> [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] ** Generic server >>> couch_compaction_daemon terminating >>> ** Last message in was {'EXIT',<0.145.0>, >>> {timeout, >>> {gen_server,call,[couch_server,get_server]}}} >>> ** When Server state == {state,<0.145.0>} >>> ** Reason for termination == >>> ** {compaction_loop_died, >>> {timeout,{gen_server,call,[couch_server,get_server]}}} >>> >>> [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] {error_report,<0.31.0>, >>> {<0.144.0>,crash_report, >>> [[{initial_call, >>> {couch_compaction_daemon,init,['Argument__1']}}, >>>{pid,<0.144.0>}, >>>{registered_name,couch_compaction_daemon}, >>>{error_info, >>> {exit, >>> {compaction_loop_died, >>> {timeout, >>>{gen_server,call,[couch_server,get_server]}}}, >>> [{gen_server,terminate,7, >>>[{file,"gen_server.erl"},{line,804}]}, >>> {proc_lib,init_p_do_apply,3, >>>[{file,"proc_lib.erl"},{line,237}]}]}}, >> ... >> >> >>> 3. Can you share your environment where you get to compile 1.6.1 >>> successfully, so we can try and reproduce this? >> >> I could prepare you a yocto setup to build a toolchain and packages for >> an qemu/docker imgage, if you are familar with that build system... >> >>> 4. Could it be that your SMB implementation doesn’t allow for opening >>> and closing files in this quick succession (with our without a rename >>> in the mix)? >> >> For testing it desn't need to run on SMB share, the timeout issue >> occures with the given fd-swap patch on a default (Linux) setup. >> >> And a strace log does not show any underlying FS issues. >> >> >> Best, >> Stefan >> >> Am 28.02.23 um 16:47 schrieb Jan Lehnardt: >>> first off, CouchDB 1.6.1 is no longer supported by this project AND it >>> has a long list of CVEs[1] against it. You REALLY should be operating >>> on a newer version. >>> >>> Secondly, just to
Re: Question about file descriptor swapping
Hi Stefan, Thanks for the additional info. I’m happy to try a yocto build here. Best Jan — > On 2. Mar 2023, at 12:24, Stefan Kral wrote: > > Hi, > > I can give you some background context: our CouchDB instance is running > on a embedded device (with minimal attack vector, so we have no pressure > to mitigate CVEs). CouchDB has been chosen because of its write append > and power fail safe property (and because of the easy scriptable > curl/json interface). > > Currently there is a production system running on a SMB1 share (mounted > in a Linux host) which works well (at least for our uses cases). SMB1 is > not logner the default on the Windows remote side. And SMB2/3 has an > issue with opening a renamend but not closed filedescriptor. The > question is, wether we can solve this issue with minimal changes. > >> 1. How did you verify that the gen_server:call/3 call never returns? >> 2. Do you get any pertinent lines (especially crashes) in your >> couch.log? > > by adding: > >> +?LOG_DEBUG("before gen_server:call", []), >> ok = gen_server:call(Db#db.main_pid, {db_updated, NewDb3}, infinity), >> +?LOG_DEBUG("after gen_server:call", []), > > the log gives: > >> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.391.0>] Compaction process >> spawned for db "asdf" >> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.84.0>] New task status for >> <0.391.0>: [{changes_done,1}, >> {database,<<"asdf">>}, >> {progress,100}, >> {started_on,1677753384}, >> {total_changes,1}, >> {type,database_compaction}, >> {updated_on,1677753384}] >> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] CouchDB swapping files >> .../asdf.couch and .../asdf.couch.compact. >> [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] before gen_server:call > > then long time nothing... > > refreshing the db in the futon web gui gives: no response > > and the log continues with: > >> [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] ** Generic server >> couch_compaction_daemon terminating >> ** Last message in was {'EXIT',<0.145.0>, >> {timeout, >> {gen_server,call,[couch_server,get_server]}}} >> ** When Server state == {state,<0.145.0>} >> ** Reason for termination == >> ** {compaction_loop_died, >> {timeout,{gen_server,call,[couch_server,get_server]}}} >> >> [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] {error_report,<0.31.0>, >> {<0.144.0>,crash_report, >> [[{initial_call, >> {couch_compaction_daemon,init,['Argument__1']}}, >>{pid,<0.144.0>}, >>{registered_name,couch_compaction_daemon}, >>{error_info, >> {exit, >> {compaction_loop_died, >> {timeout, >>{gen_server,call,[couch_server,get_server]}}}, >> [{gen_server,terminate,7, >>[{file,"gen_server.erl"},{line,804}]}, >> {proc_lib,init_p_do_apply,3, >>[{file,"proc_lib.erl"},{line,237}]}]}}, > ... > > >> 3. Can you share your environment where you get to compile 1.6.1 >> successfully, so we can try and reproduce this? > > I could prepare you a yocto setup to build a toolchain and packages for > an qemu/docker imgage, if you are familar with that build system... > >> 4. Could it be that your SMB implementation doesn’t allow for opening >> and closing files in this quick succession (with our without a rename >> in the mix)? > > For testing it desn't need to run on SMB share, the timeout issue > occures with the given fd-swap patch on a default (Linux) setup. > > And a strace log does not show any underlying FS issues. > > > Best, > Stefan > > Am 28.02.23 um 16:47 schrieb Jan Lehnardt: >> first off, CouchDB 1.6.1 is no longer supported by this project AND it >> has a long list of CVEs[1] against it. You REALLY should be operating >> on a newer version. >> >> Secondly, just to understand your motivation: you think closing and >> opening the fds after the file:rename/2 call will make things work >> for your SMB operation? >> >> If yes, the only think I could spot that is substantially different, is >> that the NewFd position is advanced implicitly by the underlying >> file:pread/3 in [2] and your SwappedFd doesn’t get the same treatment, >> but I don’t know why that should block the gen server call, as that only >> does some refcounting updates[3]. While this includes stopping the >> gen_server[4], I don’t see how the Pid this operates on should be
Re: Question about file descriptor swapping
Hi, I can give you some background context: our CouchDB instance is running on a embedded device (with minimal attack vector, so we have no pressure to mitigate CVEs). CouchDB has been chosen because of its write append and power fail safe property (and because of the easy scriptable curl/json interface). Currently there is a production system running on a SMB1 share (mounted in a Linux host) which works well (at least for our uses cases). SMB1 is not logner the default on the Windows remote side. And SMB2/3 has an issue with opening a renamend but not closed filedescriptor. The question is, wether we can solve this issue with minimal changes. > 1. How did you verify that the gen_server:call/3 call never returns? > 2. Do you get any pertinent lines (especially crashes) in your >couch.log? by adding: > +?LOG_DEBUG("before gen_server:call", []), > ok = gen_server:call(Db#db.main_pid, {db_updated, NewDb3}, infinity), > +?LOG_DEBUG("after gen_server:call", []), the log gives: > [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.391.0>] Compaction process > spawned for db "asdf" > [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.84.0>] New task status for > <0.391.0>: [{changes_done,1}, >{database,<<"asdf">>}, >{progress,100}, >{started_on,1677753384}, >{total_changes,1}, >{type,database_compaction}, >{updated_on,1677753384}] > [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] CouchDB swapping files > .../asdf.couch and .../asdf.couch.compact. > [Thu, 02 Mar 2023 10:36:24 GMT] [debug] [<0.366.0>] before gen_server:call then long time nothing... refreshing the db in the futon web gui gives: no response and the log continues with: > [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] ** Generic server > couch_compaction_daemon terminating > ** Last message in was {'EXIT',<0.145.0>, >{timeout, >{gen_server,call,[couch_server,get_server]}}} > ** When Server state == {state,<0.145.0>} > ** Reason for termination == > ** {compaction_loop_died, >{timeout,{gen_server,call,[couch_server,get_server]}}} > > [Thu, 02 Mar 2023 11:02:54 GMT] [error] [<0.144.0>] {error_report,<0.31.0>, > {<0.144.0>,crash_report, > [[{initial_call, > {couch_compaction_daemon,init,['Argument__1']}}, > {pid,<0.144.0>}, > {registered_name,couch_compaction_daemon}, > {error_info, > {exit, > {compaction_loop_died, >{timeout, > {gen_server,call,[couch_server,get_server]}}}, > [{gen_server,terminate,7, > [{file,"gen_server.erl"},{line,804}]}, >{proc_lib,init_p_do_apply,3, > [{file,"proc_lib.erl"},{line,237}]}]}}, ... > 3. Can you share your environment where you get to compile 1.6.1 >successfully, so we can try and reproduce this? I could prepare you a yocto setup to build a toolchain and packages for an qemu/docker imgage, if you are familar with that build system... > 4. Could it be that your SMB implementation doesn’t allow for opening > and closing files in this quick succession (with our without a rename > in the mix)? For testing it desn't need to run on SMB share, the timeout issue occures with the given fd-swap patch on a default (Linux) setup. And a strace log does not show any underlying FS issues. Best, Stefan Am 28.02.23 um 16:47 schrieb Jan Lehnardt: > first off, CouchDB 1.6.1 is no longer supported by this project AND it > has a long list of CVEs[1] against it. You REALLY should be operating > on a newer version. > > Secondly, just to understand your motivation: you think closing and > opening the fds after the file:rename/2 call will make things work > for your SMB operation? > > If yes, the only think I could spot that is substantially different, is > that the NewFd position is advanced implicitly by the underlying > file:pread/3 in [2] and your SwappedFd doesn’t get the same treatment, > but I don’t know why that should block the gen server call, as that only > does some refcounting updates[3]. While this includes stopping the > gen_server[4], I don’t see how the Pid this operates on should be any > different under your patch. > > So: > > 1. How did you verify that the gen_server:call/3 call never returns? > 2. Do you get any pertinent lines (especially crashes) in your couch.log? > 3. Can you share your environment where you get to compile 1.6.1 >
Re: Question about file descriptor swapping
Hi Stefan, first off, CouchDB 1.6.1 is no longer supported by this project AND it has a long list of CVEs[1] against it. You REALLY should be operating on a newer version. Secondly, just to understand your motivation: you think closing and opening the fds after the file:rename/2 call will make things work for your SMB operation? If yes, the only think I could spot that is substantially different, is that the NewFd position is advanced implicitly by the underlying file:pread/3 in [2] and your SwappedFd doesn’t get the same treatment, but I don’t know why that should block the gen server call, as that only does some refcounting updates[3]. While this includes stopping the gen_server[4], I don’t see how the Pid this operates on should be any different under your patch. So: 1. How did you verify that the gen_server:call/3 call never returns? 2. Do you get any pertinent lines (especially crashes) in your couch.log? 3. Can you share your environment where you get to compile 1.6.1 successfully, so we can try and reproduce this? 4. Could it be that your SMB implementation doesn’t allow for opening and closing files in this quick succession (with our without a rename in the mix)? [1]: https://docs.couchdb.org/en/stable/cve/index.html [2]: https://github.com/apache/couchdb/blob/1.6.x/src/couchdb/couch_db_updater.erl#L179 [3]: https://github.com/apache/couchdb/blob/1.6.x/src/couchdb/couch_db.erl#L1122-L1130 [4]: https://github.com/apache/couchdb/blob/1.6.x/src/couchdb/couch_ref_counter.erl#L84 Best Jan — Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/ 24/7 Observation for your CouchDB Instances: https://opservatory.app > On 28. Feb 2023, at 10:19, Stefan Kral wrote: > > Hi, > > I'm experimenting with a CouchDB setup on a SMB mount point. I know this > is not supported, but I ran into a (maybe simple) problem I don't > understand. Maybe someone of you can give a hint easily (that would be > amazing). > > Given the following patch (I need to close/reopen the file descriptors > after renaming) for the function > https://github.com/apache/couchdb/blob/1.6.x/src/couchdb/couch_db_updater.erl#L176 > >> 1 --- a/src/couchdb/couch_db_updater.erl >> 2 +++ b/src/couchdb/couch_db_updater.erl >> 3 @@ -202,8 +202,18 @@ handle_call({compact_done, CompactFilepath}, _From, >> #db{filepath=Path}=Db) -> >> 4 RootDir = couch_config:get("couchdb", "database_dir", "."), >> 5 couch_file:delete(RootDir, Filepath), >> 6 ok = file:rename(CompactFilepath, Filepath), >> 7 + >> 8 +ok = couch_file:close(NewDb#db.updater_fd), >> 9 +ok = couch_file:close(NewDb#db.fd), >> 10 +{ok, SwappedFd} = couch_file:open(Filepath), >> 11 +SwappedReaderFd = open_reader_fd(Filepath, Db#db.options), >> 12 +SwappedDb = NewDb2#db{ >> 13 +fd = SwappedReaderFd, >> 14 +updater_fd = SwappedFd >> 15 +}, >> 16 +unlink(SwappedFd), >> 17 close_db(Db), >> 18 -NewDb3 = refresh_validate_doc_funs(NewDb2), >> 19 +NewDb3 = refresh_validate_doc_funs(SwappedDb), >> 20 ok = gen_server:call(Db#db.main_pid, {db_updated, NewDb3}, >> infinity), >> 21 couch_db_update_notifier:notify({compacted, NewDb3#db.name}), >> 22 ?LOG_INFO("Compaction for db \"~s\" completed.", [Db#db.name]), > > then the gen_server:call() of line 20 never returns. > > Is there a major issue with this approach or just a minor mistake in my > implementation? > > > Thank you for having a look, > Stefan