On Sun, Oct 16, 2022 at 11:48:35AM +0100, cho...@jtan.com wrote:
> Kastus Shchuka writes:
> > On Sat, Oct 15, 2022 at 11:42:17PM -0300, Lucas de Sena wrote:
> > > Hi,
> > > 
> > > After trying to split a string into fields delimited with colons and
> > > spaces, I found this bug in how ksh(1) does substitution.  The actual
> > > behavior contradicts what other shells like bash and mksh do and also
> > > contradicts its own manual.
> > > 
> > > Running the following on other shells (say, bash) prints "/foo/bar/".
> > > This command splits the string " foo : bar " into two fields: "foo"
> > > and "bar", considering colon and space as delimiters.
> > > 
> > >   echo " foo : bar " | {
> > >           IFS=": "
> > >           read -r a b
> > >           printf -- "/%s/%s/\n" "$a" "$b"
> > >   }
> > > 
> > > However, running the same command in OpenBSD ksh(1) (or sh(1)) splits
> > > the string into "foo" and ": bar".
> >
> > This is because the last parameter (b) is a concatenation of two fields. 
> > Parsing 
> > is done properly if you add c to the read command:
> >
> > + echo  foo : bar 
> > + IFS=: 
> > + read -r a b c
> > + printf -- /%s/%s/%s/\n foo  bar
> > /foo//bar/
> >
> >
> > > 
> > > The manual ksh(1) provides the following, similar example:
> > > 
> > > > Example: If IFS is set to “<space>:”, and VAR is set to
> > > > “<space>A<space>:<space><space>B::D”, the substitution for $VAR
> > > > results in four fields: ‘A’, ‘B’, ‘’ (an empty field), and ‘D’.
> > > > Note that if the IFS parameter is set to the NULL string, no field
> > > > splitting is done; if the parameter is unset, the default value of
> > > > space, tab, and newline is used.
> > > 
> > > Let's try it:
> > > 
> > >   echo " A :  B::D" | {
> > >           IFS=" :"
> > >           read -r arg1 arg2 arg3 arg4
> > >           printf -- '1st: "%s"\n' "$arg1"
> > >           printf -- '2nd: "%s"\n' "$arg2"
> > >           printf -- '3rd: "%s"\n' "$arg3"
> > >           printf -- '4th: "%s"\n' "$arg4"
> > >   }
> > > 
> > > bash(1) splits the line into the following fields:
> > > 
> > >   1st: "A"
> > >   2nd: "B"
> > >   3rd: ""
> > >   4th: "D"
> > > 
> > > This is actually the expected output, as described in the manual.
> > > 
> > > However, running the same command in OpenBSD ksh, prints this:
> > > 
> > >   1st: "A"
> > >   2nd: ""
> > >   3rd: "B"
> > >   4th: ":D"
> > > 
> > > A completelly different thing.
> > > The same occurs with OpenBSD sh(1).
> >
> > What you observe is the result of the next paragraph in the man page
> > after the example you quoted:
> >
> >      Also, note that the field splitting applies only to the immediate 
> > result
> >      of the substitution.  Using the previous example, the substitution for
> >      $VAR:E results in the fields: `A', `B', `', and `D:E', not `A', `B', 
> > `',
> >      `D', and `E'.  This behavior is POSIX compliant, but incompatible with
> >      some other shell implementations which do field splitting on the word
> >      which contained the substitution or use IFS as a general whitespace
> >      delimiter.
> 
> Actually you need to look further into the manual since the word
> splitting is performed not by parameter substitution but by read:
> 
>              Reads a line of input from the standard input, separates the line
>              into fields using the IFS parameter (see Substitution above), and
>              assigns each field to the specified parameters.
> 
> This is why adding an extra variable to read above (and here) makes
> it capture the remainder of the string.
> 
> For example:
> 
>       $ alias dump="perl -MData::Dumper -e 'print Dumper @ARGV'"
>       $ dump a b c
>       $VAR1 = 'a';
>       $VAR2 = 'b';
>       $VAR3 = 'c';
> 
>       $ ( IFS=:; dump $PATH )
>       $VAR1 = '/bin';
>       $VAR2 = '/sbin';
>       $VAR3 = '/usr/bin';
>       $VAR4 = '/usr/sbin';
>       $VAR5 = '/usr/X11R6/bin';
>       $VAR6 = '/usr/local/bin';
>       $VAR7 = '/usr/local/sbin';
>       $VAR8 = '/usr/games';
> 
> So given $X:
> 
>       $ X=' A :  B::D'
> 
> Parameter substitution:
> 
>       $ ( IFS=' :'; dump $X )
>       $VAR1 = 'A';
>       $VAR2 = 'B';
>       $VAR3 = '';
>       $VAR4 = 'D';
> 
> Similarly:
> 
>       $ fn() { dump "$@" ); fn $X
>       $VAR1 = 'A';
>       $VAR2 = ':';
>       $VAR3 = 'B::D';
> 
>       $ fn() { dump "$@"; }; ( IFS=' :'; fn $X )
>       $VAR1 = 'A';
>       $VAR2 = 'B';
>       $VAR3 = '';
>       $VAR4 = 'D';
> 
> read substitution:
> 
>       $ echo "$X" | ( IFS=' :'; read a1 a2 a3; dump "$a1" "$a2" "$a3" )
>       $VAR1 = 'A';
>       $VAR2 = '';
>       $VAR3 = 'B::D';
> 
>       $ echo "$X" | ( IFS=' :'; read a1 a2 a3 a4; dump "$a1" "$a2" "$a3" 
> "$a4" )
>       $VAR1 = 'A';
>       $VAR2 = '';
>       $VAR3 = 'B';
>       $VAR4 = ':D';
> 
>       $ echo "$X" | ( IFS=' :'; read a1 a2 a3 a4 a5; dump "$a1" "$a2" "$a3" 
> "$a4" "$a5" )
>       $VAR1 = 'A';
>       $VAR2 = '';
>       $VAR3 = 'B';
>       $VAR4 = '';
>       $VAR5 = 'D';
> 
> It does look like read, which uses its own expansion routine, has
> a bug: a2/VAR2 should be 'B' (or 'B::D') not ''.

Not sure if it is a bug or a feature. 

Parser in /usr/src/bin/ksh/c_sh.c skips leading whitespace (if it is included 
in IFS) of the field
and identifies end of the filed when it encounters first IFS character after 
non-IFS. It does not
skip trailing whitespace. So, going with our example

X=' A :  B::D'

Leading space is skipped, then A is accumulated in the first field, and the 
field is closed by the space
after A. The parser reads the next character and it happens to be IFS. Parser 
is dealing with the second field now.
Accumulated second field is empty at this point, and it is closed by IFS 
(colon). This is how we get empty
second field.

Please keep in mind that this logic is correct for the first empty field. If we 
parse a string starting 
with whitespace followed by non-whitespace IFS character, we expect to get 
empty first field. 
(and this is where all shells agree). OpenBSD ksh applies the same logic to any 
field (second, third, and so on).
That is why a sequence of one or more whitespaces followed by non-whitespace 
IFS character anywhere in the
string fed to read builtin creates an empty field.

I guess other shells process first field differently than other fields.

> 
> Matthew
> 

Reply via email to