On Wed, Jul 2, 2025 at 7:19 AM Jukka Laitinen <jukka.laiti...@iki.fi> wrote:
> Hi, > > I didn't follow the full discussion, but we had similar problems with > sockets, file descriptors and dup. Just to draw your attention, this was > the PR / patches where the issue was fixed: > > https://github.com/apache/nuttx/pull/16499 (originally > https://github.com/apache/nuttx/pull/16361) > > - I am mentioning the original PR just because the patches were taken > over and rewritten by Xiaomi, and we never updated to the upstream > version nor have tested them. But perhaps also the upstream PR works, as > I suppose it was tested by people who decided to re-wrote it. > > So please check if the above fixes are in your branch, otherwise dup > just won't work and you will end up using file descriptions to random > files/devices. > > Br, > > Jukka > > > On 1.7.2025 22.43, Tim Hardisty wrote: > > I have wasted WAY too many days trying to understand what's going on > > here: you start on the premise that it "must of worked" but I am > > really not so sure with this! > > > > I think the idea of the timeout is simply to cover the case that the > > CGI "app" goes AWOL and needs to be killed, if so configured in > > Kconfig to do that. > > > > Please bear in mind that my POSIX/NuttX/RTOS skills are > > limited...but...the file descriptors don't seem to behave as I would > > expect, based on NuttX and POSIX documentation. > > > > * The file descriptors correctly exist in the cgi() function but as > > soon as the task_create is called the FD becomes invalid, to the > > cgi() functiona that called the task_create. NuttX says that a newly > > created task only inherits the first 3 descriptors, but it seems > > that the very fact of calling a task_create seems to kill higher > > FD's. Is that correct behaviour? Note: I was trying so many > > different things, my "scatter gun" approach might have misled me. > > * I tried pthread_create (and posix_spawn) instead of task_create() > > but the FD still seemed to get trashed. So there may be more going > > on here than my skills allow me to investigate. > > * I also saw issues that the group ownership of a pthread instead of > > task was seemingly wrong. But I deferred to my "lack of experience" > > as the reason. > > > > Right now, reverting the PR that added the O_CLOEXEC has made it work. > > Maybe not fixed it "properly", but at least made it work. > > > > On 01/07/2025 19:50, Bernd Walter wrote: > >> Out of curiosity I just took a look into the thhtp cgi code and noticed > >> something I don't understand myself. > >> > >> There is the cgi_child(), which looks like it is to be called after a > >> fork > >> for the child case. > >> It prepares the filedescriptors and calls exec. > >> The result of the exec is stored in a variable named child, which is > >> an odd name. > >> Normally exec never returns and if it does it is an error. > >> The normal thing would be for the child to kill itself. > >> It goes into error handling in case the exec returned < 0, which it > >> always > >> does when exec returns. > >> However the next thing it does is setting up a timeout for the child. > >> It is the child and it already failed the exec, why the timeout when > >> we are > >> already in an error state because we returned from an exec. > >> > >> However: > >> The function is called with task_create, so not a fork. > >> If I got it right then the task has separate filedescriptors as a > >> forked process > >> would have and exec closing the task copies of the sockets should be > >> fine, as > >> long as the use count on those are properly increased. > >> But if it is intendend to behave like fork/exec, why does stuff > >> happen after > >> the exec? > >> > >> > >> On Tue, Jul 01, 2025 at 03:00:23PM -0300, Alan C. Assis wrote: > >>> Hi Tim, > >>> > >>> You are right, it doesn't execute, but some subprocess (like a CGI) > >>> could > >>> try to execute. > >>> > >>> This comment there shed some light about it: > >>> > >>> "I wouldn't describe O_CLOEXEC as there principally for privilege > >>> escalation / security reasons -- it's also very, > >>> very common to have non-security bugs happen (frequently of the > >>> indefinite-blocking variety) if a FD is left > >>> open beyond when it's intended to be closed because a subprocess > >>> still has > >>> it." > >>> > >>> So, why does removing SOCK_CLOEXEC make http work? If the fd is not > >>> executed, the socket shouldn't be closed, right? > >>> > >>> And why was it working in the past? Which modification broke this? > >>> Maybe understanding it is important to have the right fix (maybe > >>> removing > >>> it is acting as a band-aid). > >>> > >>> Wengzhe, could you please help us to understand this network issue? > >>> > >>> BR, > >>> > >>> Alan > >>> > >>> On Tue, Jul 1, 2025 at 12:28 PM Tim Hardisty<timhardist...@gmail.com> > >>> wrote: > >>> > >>>> But that's the point - thttp *does* call exec() so the open socket > >>>> file > >>>> descriptor gets closed when it is still needed by the exec'd > >>>> application. > >>>> > >>>> If there's another way of doing this I'm listening! > >>>> > >>>> On 01/07/2025 16:13, Alan C. Assis wrote: > >>>>> Hi Tim, > >>>>> > >>>>> Nice finding! > >>>>> > >>>>> Now we need to understand why this worked in the past and now it > >>>>> doesn't. > >>>>> > >>>>> Also, what are the implications of removing SOCK_CLOEXEC? A few > >>>>> pointers > >>>>> here: > >>>>> > >>>> > https://stackoverflow.com/questions/22304631/what-is-the-purpose-to-set-sock-cloexec-flag-with-accept4-same-as-o-cloexec > >>>> > >>>>> BR, > >>>>> > >>>>> Alan > >>>>> > >>>>> On Tue, Jul 1, 2025 at 11:27 AM Tim Hardisty<timhardist...@gmail.com > > > >>>>> wrote: > >>>>> > >>>>>> The error was, indeed, the socket being opened with the SOCK_CLOEXEC > >>>>>> flasg set. > >>>>>> > >>>>>> PR to follow. > >>>>>> > >>>>>> On 28/06/2025 16:16, Tim Hardisty wrote: > >>>>>>> Actually - it might be a change last year. The socket is now opened > >>>>>>> like this and I assume CLOEXEC will mess up the operation of the > >>>>>>> executed CGI app (will investigate on Monday; not sure what socket > >>>>>>> mode it needs to be): > >>>>>>> > >>>>>>> hc->conn_fd = accept4(listen_fd, (struct sockaddr *)&sa, &sz, > >>>>>>> SOCK_CLOEXEC); > >>>>>>> > >>>>>>> On 28/06/2025 13:22, Alan C. Assis wrote: > >>>>>>>> Hi Tim, > >>>>>>>> > >>>>>>>> Yes, I think send() is the preferred form to work with sockets > >>>>>>>> because you > >>>>>>>> can have fine control, i.e. passing flags at forth argument > >>>>>>>> (MSG_DONTWAIT, > >>>>>>>> etc). > >>>>>>>> > >>>>>>>> If you suspect that the bug was caused by some recent > >>>>>>>> modification, > >>>>>>>> try to > >>>>>>>> find a supported board that was used to test thttpd in the past > >>>>>>>> and > >>>>>>>> test an > >>>>>>>> old NuttX release with it. > >>>>>>>> This is the approach I use to double check if something is > >>>>>>>> broken in > >>>> the > >>>>>>>> mainline. > >>>>>>>> > >>>>>>>> BR, > >>>>>>>> > >>>>>>>> Alan > >>>>>>>> > >>>>>>>> On Fri, Jun 27, 2025 at 3:39 PM Tim Hardisty > >>>>>>>> <timhardist...@gmail.com > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Is it as "simple" as thttpd should do: > >>>>>>>>> > >>>>>>>>> nwritten= send(sock_fd, buffer, totalbytesread, 0); > >>>>>>>>> > >>>>>>>>> rather than the generic: > >>>>>>>>> > >>>>>>>>> nwritten= write(sock_fd, buffer, nbytes); > >>>>>>>>> > >>>>>>>>> On 27/06/2025 18:40, Tim Hardisty wrote: > >>>>>>>>>> Trying to get thttpd's CGI handling working and have found > >>>>>>>>>> that the > >>>>>>>>>> dup(2) calls of stdin and stdout return a file descriptor that's > >>>>>>>>>> already been allocated to the NET socket (via thttpd I think). > >>>>>>>>>> > >>>>>>>>>> That isn't right is it? > >>>>>>>>>> > >>>>>>>>>> I am not sure if it's a side effect of something that thttpd > >>>>>>>>>> does > >>>>>>>>>> (that might have been OK in the past but is now not right) or a > >>>> NuttX > >>>>>>>>>> bug, of a missing Kconfig setting that relates to this. > >>>>>>>>>> > >>>>>>>>>> The result is that the ultimate copying of buffered html that > >>>>>>>>>> should > >>>>>>>>>> be written via the socket FD gets rejected as the FD doesn't > >>>>>>>>>> have WR > >>>>>>>>>> access (and is now the wrong FD anyway!). > >>>>>>>>>> > >>>>>>>>>> Perhaps there's been a change in the way NuttX deals with all of > >>>> this > >>>>>>>>>> that didn't get sorted in thttpd? > >>>>>>>>>> >