Re: bash sockets: printf \x0a does TCP fragmentation

Bob Proulx Sat, 22 Sep 2018 17:56:28 -0700

I see that you have subscribed now.  Awesome!  If you and others would
be so kind as to list-reply instead of CC'ing me directly that would
be great.  I read the replies on the mailing list.

[email protected] wrote:
> Bob Proulx wrote:
> > You are doing something that is quite unusual.  You are using a shell
> > script direction on a TCP socket.  That isn't very common.  
> 
> Do you think there should be a paragraph NOT COMMON where bash sockets
> should rather belong to?

You actually had not included enough background information to know if
you were using the bash built in network implementation or not.  You
only showed that you had set up fd 5 connected to a network socket.
That can happen because, for example, a script was used to service an
inetd configuration or similar.  It doesn't actually need to be the
built in network protocol at all.  But now that you have said the
above I guess I can assume that you are using the built in
implementation.

As to whether the documentation should say this or not that is not
really practical.  There are a godzillian different things that are
not typically addressed by writing a shell script.  As a practical
matter it is impossible to list everything out explicitly.  And if one
tries then the complaint is that the documentation is so long and
detailed that is is unusable due to it.

Primarily a shell script is a command and control program.  It is very
good for that purpose.  It is typically used for that purpose.  That
is the mainstream use and it is very unlikely one will run into
unusual situations there.

But programming tasks that are much different from command and control
tasks, such as your program interacting by TCP with other devices on
the network, are not as common.  I don't have facts to back that up
but I do believe that to be true based upon the way I have seen shell
scripts being programmed and used over a long period of time.  Of
course if you have spent the last 20 years programming network shell
scripts then your observations will bias you the other way. :-)

> > More
> > typically one would use a C program instead.  So it isn't surprising
> > that you are finding interactions that are not well known.
> 
> Bob, my intention was not to discuss program languages and what is typical
> with you or anybody else here.

Hmm...  Put yourself in our shoes.  You stood up on the podium that is
this public mailing list and spoke into the megaphone addressing all
of us complaining that bash's printf was buggy.  But from my
perspective printf is behaving as expected.  It is designed to deal
with line oriented data.  It will also deal with binary data if one is
careful.  But it appears that your application wasn't careful enough
and had tripped over some problems.

Should we (me!) keep silent about those very obvious problems?  It
feels obvious to me but apparently not to the author of the above.  As
has often been said many eyes make all bugs apparent.  I was pointing
this out to you as a public service.  But in response you seem hostile
by the language above and below.  That isn't encouraging any help. :-(

> >> printf -- "$data" >&5 2>/dev/null
> > 
> > Why is stderr discarded?  That is almost always bad because it
> > discards any errors that might occur.  You probably shouldn't do this.>
> > What happens if $data contains % format strings?  What happens if the
> > format contains a sequence such as \c?  This looks problematic.  This
> > is not a safe programming proctice.
> 
> I doubt you can judge on this by just looking at a single line
> of code -- the project has > 18k LoC in bash.

That single line of code was problematic just by itself standing alone
without the rest of the program around it.  That is independent of
anything the rest of the program might contain.

However if you would like to pass sections of the rest of the program
through the help-bash mailing list then I am sure the group there
would help improve the quality of it.

> Github is the place to discuss and do PRs for our project.

No.  Sorry.  You came here to this mailing list.  Therefore this is
the place to discuss it.  Please put yourself in my shoes.  If the
case were reversed and I came over to Github and then stated that
Github was not the place for the discussion but that you needed to set
up email and come over to my mailing list and discuss it there
instead.  How would you feel?  I had come into your house, asked you
for help, then wanted you to go elsewhere?  How would you feel?  I can
tell you that I do not feel very welcome by it.

Also remember that Github is a non-free service.  That is free as in
freedom, not free as in beer.  The free in Free Software.  Or in this
case the opposite of it being non-free.  We try not to use software
that does not respect our freedoms nor ask others to do so either.
It's a philosophy of life thing.  I hope you will understand.

> >> If there's a workaround, please let me know. (tried to add "%b" with no
> >> effect). Otherwise I believe it's a bug.

Note that I *did* provide you with a way to do what you wanted to do. :-)

It was also noted in another message that the external standalone
printf command line utility did buffer as you desired.  That seems
another very good solution too.  Simply use "command printf ..." to
force using the external version.

Anyway...  Since printf is a text oriented utility it makes sense to
me that I would operate in line buffered output mode.

Let's look at the bash documentation for 'help printf':

  printf: printf [-v var] format [arguments]
    Formats and prints ARGUMENTS under control of the FORMAT.
...    
    FORMAT is a character string which contains three types of
    objects: plain characters, which are simply copied to standard
    output; character escape sequences, which are converted and copied
    to the standard output; and format specifications, each of which
    causes printing of the next successive argument.

The format provided in your example in $data is interpreted as a
"character string".  Apparently newlines (\n a.k.a. 0x0a characters)
are used in the binary data in your implementation!  However as a
newline character it is causing line buffered output to be flushed
resulting in line oriented write(2) calls.

If you are trying to print raw binary data then I don't think you
should be using 'printf' to do it.  It just feels like the wrong
utility to be used to me.  Also there was the problematic use of it in
the format string.

Instead I would use utilities designed to work with binary data.  Such
as 'cat'.  I personally might prepare a temporary file containing
exactly the raw data that is needed to be transmitted and then use
"cat $tmpfile >&5" to transmit it.  Or if I wanted strict control of
the block size making cat less appropriate then I would use "dd
if=$tmpfile status=none bs=1M >&5" or some such where no
interpretation of the data is done.

However there may be a bug in the way bash opens that fd number 5 and
sets up buffering.  If it were me then I would look closely there.  It
is possible however that file descriptor was being opened that it
should be using block buffering instead of line buffering.  Since the
network socket is not a tty I would suspect that it should be using
block buffering.  That is what I would expect.  Therefore that is
where I would look for a bug.  Obviously I can be wrong though too.

One should double check that fd 5 is not a tty.

  if [ -t 5 ]; then

If it is a tty when I expect line buffering.  If it is not then I
would expect block buffering.  Just as a general statement about
programs using libc's stdio to write to it.

> > You can re-block the output stream using other tools such as 'cat' or
> > 'dd'.  Since you are concerned about block size then perhaps dd is the
> > better of the two.
> > 
> >   | cat
> 
> cat has a problem with binary chars, right? And: see below.

No.  It does not.  The 'cat' utility concatenates files.  From the cat
documentation:

  ‘cat’ copies each FILE (‘-’ means standard input), or standard input if
  none are given, to standard output.  Synopsis:
  ...
     On systems like MS-DOS that distinguish between text and binary
  files, ‘cat’ normally reads and writes in binary mode.  However, ‘cat’
  reads in text mode if one of the options ‘-bensAE’ is used or if ‘cat’
  is reading from standard input and standard input is a terminal.
  Similarly, ‘cat’ writes in text mode if one of the options ‘-bensAE’ is
  used or if standard output is a terminal.

> > Or probably better:
> > 
> >   | dd status=none bs=1M
> > 
> > Or use whatever block size you wish.  The 'dd' program will read the
> > input into its buffer and then output that block of data all in one
> > write(2).  That seems to be what you are wanting.
> 
> We actually use dd to read from the socket. Of course we could use
> writing to it as well -- at a certain point of time.

Great!  Problem solved then. :-)

I didn't say it before but since this is such a long email making it a
little longer won't hurt more.  The status=none dd option is a GNU
extension.  It is useful in this context.  But it is not a portable dd
option.  Other platforms may or may not implement it.  *BSD implements
it now but some of my beloved legacy Unix platforms do not.

  http://pubs.opengroup.org/onlinepubs/9699919799/utilities/dd.html

> Still, a prerequisite would be that printf is the culprit and not
> how bash + libs do sockets.

The repeated mention of sockets nudges me to point out that sockets
are just files.  There is nothing special about them as such.  Trying
to find fault there is just a false path to follow.  Programs writing
to a file descripted connected to a network socket don't "know"
anything about the network.  It is the network layer that is taking
each write(2) and sending out packets.

What is special is whether the device is a tty or not.  If it is a tty
then libc's standard I/O buffering does one thing.  If it is not a tty
then libc's standard I/O buffering does a different thing.  Let's look
at the documentation.

For me when I want to look up documentation matching my system I use
the locally installed info pages.  But for the purposes of showing
where this documentation exists I will point to the top of tree
version online.  However note that it may be newer than what you have
installed locally.

https://www.gnu.org/software/libc/manual/html_node/Stream-Buffering.html#Stream-Buffering

https://www.gnu.org/software/libc/manual/html_node/Buffering-Concepts.html#Buffering-Concepts

    Newly opened streams are normally fully buffered, with one
    exception: a stream connected to an interactive device such as a
    terminal is initially line buffered.
    ...
    The use of line buffering for interactive devices implies that
    output messages ending in a newline will appear immediately-which
    is usually what you want.

Additionally the stdio man page says:

    man stdio

       Output streams that refer to terminal devices are always line buffered
       by default; pending output to such streams is written automatically
       whenever an input stream that refers to a terminal device is read.  In
       cases where a large amount of computation is done after printing part
       of a line on an output terminal, it is necessary to fflush(3) the
       standard output before going off and computing so that the output will
       appear.

However I did not look at how bash's implementation of printf was
coded.  The above is just general information that generally applies
to all utilities.

> > P.S. You can possibly use the 'stdbuf' command to control the output
> > buffering depending upon the program.
> > 
> >   info stdbuf
> 
> That could be an option, thanks. Need to check though whether
> 
> a) it doesn't fragment then -- not sure while reading it

I feel compelled to say that the network stack is going to transmit a
packet every time write(2) is called.  Programs doing the writing
don't know that they are writing to a network stream.  They are just
writing data using write(2).  If it is a fully network aware program
then of course it may be using sendto(2) or other network specific
call.  But general filter utilities are not going to be using those
calls and are just going to read(2) and write(2) and not have any
specific network coding.  That's part of the beauty of the Unix
Philosophy.  Everything is a file.  In your case though you are trying
to pump around binary data and are using line oriented text utilities
that are using line buffering and that is where problems are being
tripped over.

You are thinking of this as fragmentation.  Because in your
application it appears to you in your context as fragmentation.  But
as a general statement it isn't fragmentation.  It is just a data
stream being written every time it is being written.  Certainly any
text program writing lines out isn't going to be coded in any way that
knows about TCP data blocks.  For any program in the middle it is just
lines of text in and lines of text out.  Or in the case of other
programs that deal with binary data such as 'cat' it is just bytes in
and bytes out.  The concept of fragmentation belongs to a different
layer of the software block diagram.

[[
There is an old joke related to this too.  "The Unix way -- everything
is a file.  The Linux way -- everything is a filesystem."  Haha!

And also a quote, "I think the major good idea in Unix was its clean
and simple interface: open, close, read, and write." --Ken Thompson
]]

> b) it's per default available on every platform supported by testssl.sh.

The 'stdbuf' utility is included in GNU coreutils starting with
version 7.5 onward.  It may not be available on other platforms.  It
didn't feel like the right solution to me.  But I mentioned it in
passing in the P.S. because it is related.  Perhaps it will be useful
to you.

Hope this helps! :-)

Bob

Re: bash sockets: printf \x0a does TCP fragmentation

Reply via email to