Hi,
I didn't follow the full discussion, but we had similar problems with
sockets, file descriptors and dup. Just to draw your attention, this was
the PR / patches where the issue was fixed:
https://github.com/apache/nuttx/pull/16499 (originally
https://github.com/apache/nuttx/pull/16361)
- I am mentioning the original PR just because the patches were taken
over and rewritten by Xiaomi, and we never updated to the upstream
version nor have tested them. But perhaps also the upstream PR works, as
I suppose it was tested by people who decided to re-wrote it.
So please check if the above fixes are in your branch, otherwise dup
just won't work and you will end up using file descriptions to random
files/devices.
Br,
Jukka
On 1.7.2025 22.43, Tim Hardisty wrote:
I have wasted WAY too many days trying to understand what's going on
here: you start on the premise that it "must of worked" but I am
really not so sure with this!
I think the idea of the timeout is simply to cover the case that the
CGI "app" goes AWOL and needs to be killed, if so configured in
Kconfig to do that.
Please bear in mind that my POSIX/NuttX/RTOS skills are
limited...but...the file descriptors don't seem to behave as I would
expect, based on NuttX and POSIX documentation.
* The file descriptors correctly exist in the cgi() function but as
soon as the task_create is called the FD becomes invalid, to the
cgi() functiona that called the task_create. NuttX says that a newly
created task only inherits the first 3 descriptors, but it seems
that the very fact of calling a task_create seems to kill higher
FD's. Is that correct behaviour? Note: I was trying so many
different things, my "scatter gun" approach might have misled me.
* I tried pthread_create (and posix_spawn) instead of task_create()
but the FD still seemed to get trashed. So there may be more going
on here than my skills allow me to investigate.
* I also saw issues that the group ownership of a pthread instead of
task was seemingly wrong. But I deferred to my "lack of experience"
as the reason.
Right now, reverting the PR that added the O_CLOEXEC has made it work.
Maybe not fixed it "properly", but at least made it work.
On 01/07/2025 19:50, Bernd Walter wrote:
Out of curiosity I just took a look into the thhtp cgi code and noticed
something I don't understand myself.
There is the cgi_child(), which looks like it is to be called after a
fork
for the child case.
It prepares the filedescriptors and calls exec.
The result of the exec is stored in a variable named child, which is
an odd name.
Normally exec never returns and if it does it is an error.
The normal thing would be for the child to kill itself.
It goes into error handling in case the exec returned < 0, which it
always
does when exec returns.
However the next thing it does is setting up a timeout for the child.
It is the child and it already failed the exec, why the timeout when
we are
already in an error state because we returned from an exec.
However:
The function is called with task_create, so not a fork.
If I got it right then the task has separate filedescriptors as a
forked process
would have and exec closing the task copies of the sockets should be
fine, as
long as the use count on those are properly increased.
But if it is intendend to behave like fork/exec, why does stuff
happen after
the exec?
On Tue, Jul 01, 2025 at 03:00:23PM -0300, Alan C. Assis wrote:
Hi Tim,
You are right, it doesn't execute, but some subprocess (like a CGI)
could
try to execute.
This comment there shed some light about it:
"I wouldn't describe O_CLOEXEC as there principally for privilege
escalation / security reasons -- it's also very,
very common to have non-security bugs happen (frequently of the
indefinite-blocking variety) if a FD is left
open beyond when it's intended to be closed because a subprocess
still has
it."
So, why does removing SOCK_CLOEXEC make http work? If the fd is not
executed, the socket shouldn't be closed, right?
And why was it working in the past? Which modification broke this?
Maybe understanding it is important to have the right fix (maybe
removing
it is acting as a band-aid).
Wengzhe, could you please help us to understand this network issue?
BR,
Alan
On Tue, Jul 1, 2025 at 12:28 PM Tim Hardisty<timhardist...@gmail.com>
wrote:
But that's the point - thttp *does* call exec() so the open socket
file
descriptor gets closed when it is still needed by the exec'd
application.
If there's another way of doing this I'm listening!
On 01/07/2025 16:13, Alan C. Assis wrote:
Hi Tim,
Nice finding!
Now we need to understand why this worked in the past and now it
doesn't.
Also, what are the implications of removing SOCK_CLOEXEC? A few
pointers
here:
https://stackoverflow.com/questions/22304631/what-is-the-purpose-to-set-sock-cloexec-flag-with-accept4-same-as-o-cloexec
BR,
Alan
On Tue, Jul 1, 2025 at 11:27 AM Tim Hardisty<timhardist...@gmail.com>
wrote:
The error was, indeed, the socket being opened with the SOCK_CLOEXEC
flasg set.
PR to follow.
On 28/06/2025 16:16, Tim Hardisty wrote:
Actually - it might be a change last year. The socket is now opened
like this and I assume CLOEXEC will mess up the operation of the
executed CGI app (will investigate on Monday; not sure what socket
mode it needs to be):
hc->conn_fd = accept4(listen_fd, (struct sockaddr *)&sa, &sz,
SOCK_CLOEXEC);
On 28/06/2025 13:22, Alan C. Assis wrote:
Hi Tim,
Yes, I think send() is the preferred form to work with sockets
because you
can have fine control, i.e. passing flags at forth argument
(MSG_DONTWAIT,
etc).
If you suspect that the bug was caused by some recent
modification,
try to
find a supported board that was used to test thttpd in the past
and
test an
old NuttX release with it.
This is the approach I use to double check if something is
broken in
the
mainline.
BR,
Alan
On Fri, Jun 27, 2025 at 3:39 PM Tim Hardisty
<timhardist...@gmail.com
wrote:
Is it as "simple" as thttpd should do:
nwritten= send(sock_fd, buffer, totalbytesread, 0);
rather than the generic:
nwritten= write(sock_fd, buffer, nbytes);
On 27/06/2025 18:40, Tim Hardisty wrote:
Trying to get thttpd's CGI handling working and have found
that the
dup(2) calls of stdin and stdout return a file descriptor that's
already been allocated to the NET socket (via thttpd I think).
That isn't right is it?
I am not sure if it's a side effect of something that thttpd
does
(that might have been OK in the past but is now not right) or a
NuttX
bug, of a missing Kconfig setting that relates to this.
The result is that the ultimate copying of buffered html that
should
be written via the socket FD gets rejected as the FD doesn't
have WR
access (and is now the wrong FD anyway!).
Perhaps there's been a change in the way NuttX deals with all of
this
that didn't get sorted in thttpd?