Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-18 Thread Christoph Anton Mitterer via austin-group-l at The Open Group
Hey.


On Tue, 2022-04-19 at 01:52 +0100, Harald van Dijk wrote:
> Even I did not to apply this to pattern matching. The 
> lexical locale, the locale used for lexing, is only used for lexing, 
> i.e. for recognising tokens, not to how those tokens are then 
> interpreted later on. If locale comes into play for that, as it does
> in 
> pattern matching, it is the then-current value of LC_CTYPE that comes
> into play, as it does in other shells.

So... how is (as per the standard) it intended to work?

My understanding was that if during lexing it sees a pattern '*∈' it
would store the binary representation (as following from the lexical
locale, in which the shell script/input is in principle expected to be)
of these characters for the pattern.

But when the actual pattern matching is done, it would interpret that
binary representation with respect to the current locale (LC_CTYPE).
So if by then, then binary representation of the script's '*∈' would
mean '*z?' in the current locale, it would use that meaning as the
pattern.

Does that sound right?


'∈' not being a member of the portable character set would make it,
AFAIU, in principle valid for being mapped to `z?` in another locale -
while changing the mapping of '*' would be possible, but according to
POSIX produce undefined results.

("If the encoded values associated with each member of the portable
character set are not invariant across all locales supported by the
implementation, if an application uses any pair of locales where the
character encodings differ, or accesses data from an application using
a locale which has different encodings from the locales used by the
application, the results are unspecified.")


> As for future directions, no opinion on that from me.

That would IMO only make sense, if e.g. there was only one and not even
well maintained shell that behaves different from all others.

The "future directions" would indicate to possible new implementers
where things may go and what they should do.
10 years later, one could re-visit the topic, and if that one shell
that behaved different from all others had died in the meantime, and
any possible new ones followed the future directions... one could
standardise it. If not, one could simply leave everything as is and no
one would get into troubles.

Whether such approach actually works out as intended is of course not
guaranteed.


> I would not think this should be a special case: «${foo%.}» should
> strip 
> a trailing «.» in exactly those cases where the shell considers foo
> to 
> match the pattern «*.». However, I can see value in doing some extra 
> tests to verify that this matches what shells do.

Remember that it might not be enough to check whether such shell strip
off correctly when one has the case
  
but also the case where one or more trailing bytes of the first group
and the bytes of the valid character form a new valid character.

While this wouldn't be possible if '.' is the characters (because of
it's special properties)... it can happen with other characters in some
special locales.


> Very well, I will post tests and test results as soon I can make the 
> time for it.

Thanks.


FYI: I think the outcome will also affect the current proposal for
#1561:
https://www.austingroupbugs.net/view.php?id=1561#c5795

in specific the part:
On page 2321 line 74857 section 2.6.2 Parameter Expansion, change:


Thanks,
Chris.



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-18 Thread Harald van Dijk via austin-group-l at The Open Group

On 15/04/2022 04:57, Christoph Anton Mitterer wrote:

On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l
at The Open Group wrote:

Hmm, I would.


I like that :-D This would have been the preferred alternative I've
asked for to look at, in the ticket.




Shells
are not in agreement on whether such single bytes can be matched with
[...], nor in those shells where they can be, whether multiple
bracket
expressions can be used to match the individual bytes of a valid
multi-byte character.

The cases with [...] only come up when scripts themselves use
patterns
that are not valid character strings


You mean in the lexical locale?


I do not, but interesting question. I am one of the few, if not only, 
shell authors that actually implemented "Changing the value of LC_CTYPE 
after the shell has started shall not affect the lexical processing of 
shell commands in the current shell execution environment or its 
subshells" rule. Even I did not to apply this to pattern matching. The 
lexical locale, the locale used for lexing, is only used for lexing, 
i.e. for recognising tokens, not to how those tokens are then 
interpreted later on. If locale comes into play for that, as it does in 
pattern matching, it is the then-current value of LC_CTYPE that comes 
into play, as it does in other shells.



they are unlikely to affect
existing scripts and I imagine there is not much harm in leaving
those
unspecified.


It should however be clearly described that behaviour in this field is
undefined, perhaps with some "future directions" that this might change
some day.


I prefer explicit over implicit as well myself. Perhaps it does not even 
need to be undefined though, perhaps unspecified with a few limited 
options is good enough. I am not sure at this time whether that is feasible.


As for future directions, no opinion on that from me.


The cases with * and ? do come up in existing scripts, but
if shells are in agreement as they appear to be, there is no need to
coordinate with shell authors on whether they would be willing to
change
their implementations, it is possible to change POSIX to describe the
shells' current behaviour.


Well but it's not only * and ? ... it's also a single character
matching that character in a byte string that contains bytes or
sequences thereof which do not form any valid character ... both before
or after that character to be matched.


Yes, I did mention those earlier on in my message but forgot to repeat 
it here. It's where shells also appear to be in agreement, except in the 
same corner case that also applies to [...] where an invalid byte in a 
pattern is used to match part of a valid character in the string.



And since pattern matching notation isn't just used for matching alone,
but e.g. also for string manipulation in parameter expansion (e.g.
"${foo%.}" case)... these shells would also need to agree how to handle
that, wouldn't they?


I would not think this should be a special case: «${foo%.}» should strip 
a trailing «.» in exactly those cases where the shell considers foo to 
match the pattern «*.». However, I can see value in doing some extra 
tests to verify that this matches what shells do.



If there is interest in getting this standardised, I can spend some
more
time on creating some hopefully comprehensive tests for this to
confirm
in what cases shells agree and disagree, and use that as a basis for
proposing wording to cover it.


I'd love to see that and if you'd actually do so, I'd kindly ask
Geoff to defer any changes in the ticket #1564 of mine, until it can be
said whether it might be possible to get that standardised.


Very well, I will post tests and test results as soon I can make the 
time for it.


Cheers,
Harald van Dijk



Re: How do I get the buffered bytes in a FILE *?

2022-04-18 Thread Rob Landley via austin-group-l at The Open Group
Sigh, spam filter impounded this. (Gotta move off gmail...)

On 4/12/22 04:42, Geoff Clare via austin-group-l at The Open Group wrote:
> Rob Landley wrote, on 11 Apr 2022:
>>
>> A bunch of protocols (git, http, mbox, etc) start with lines of data 
>> followed by
>> a block of data, so it's natural to want to call getline() and then handle 
>> the
>> data block. But getline() takes a FILE * and things like zlib and sendfile()
>> take an integer file descriptor.
>> 
>> Posix lets me get the file descriptor out of a FILE * with fileno(), but the
>> point of FILE * is to readahead and buffer. How do I get the buffered data 
>> out
>> without reading more from the file descriptor?
>> 
>> I can't find a portable way to do this?
> 
> I tried this sequence of calls on a few systems, and it worked in the
> way you would expect:
> 
> fgets(buf, sizeof buf, fp);
> int fd = dup(fileno(fp));
> close(fileno(fp));
> while ((ret = fread(buf, 1, sizeof buf, fp)) > 0) { ... }
> read(fd, buf, sizeof buf);
> 
> It relies on fread() not detecting EBADF until it tries to read more
> data from the underlying fd.

Hmmm. That's an interesting approach.

> It has some caveats:
> 
> 1. It needs a file descriptor to be available.

Understood, but acceptable.

> 2. The close() will remove any fcntl() locks that the calling process
>holds for the file.

Fine.

> 3. In a multi-threaded process it has the usual problem around fd
>inheritance, but that's addressed in Issue 8 with the addition
>of dup3().

Threads break everything anyway, but you could dup2(/dev/null) if you cared
about them.

> Also, for the standard to require it to work, I think we would need to
> tweak the EBADF error for fgetc() (which fread() references) to say:
> 
> The file descriptor underlying stream is not a valid file
> descriptor open for reading and there is no buffered data
> available to be returned.
> 
> (adding the "and ..." part).

Sounds reasonable. I'll give this a try.

Thanks,

Rob



Re: 答复: How do I get the buffered bytes in a FILE *?

2022-04-18 Thread Chet Ramey via austin-group-l at The Open Group

On 4/18/22 12:53 AM, Rob Landley wrote:

On 4/17/22 18:10, Chet Ramey wrote:

On 4/16/22 2:58 PM, Rob Landley via austin-group-l at The Open Group wrote:

Q) "How do I switch from FILE * to fd via fileno() without losing data."

A) "Don't use FILE *"

That's not the question I asked?


The answer is correct, but incomplete. The missing piece is that if you
want to use FILE *, the operation you want, and the information you need to
implement it, are not part of the public API.


Which is a fixable problem.


Sure, everything's fixable. It's not what you asked, though.




Other than using a strategy like Geoff suggested early on, or trying
something like setvbuf to turn off buffering on the FILE * completely, the
buffer associated with a FILE * and the indexes into it that say how much
data you've consumed from the underlying source are opaque.


https://github.com/coreutils/gnulib/blob/master/lib/freadahead.c


So the gnulib folks looked at a bunch of different stdio implementations
and used non-public (or at least non-standard) portions of the
implementation to agument the stdio API.

If that's what you want to do, propose adding freadahead to the standard.

Or reimplement the gnulib work and accept that the stdio implementation
can potentially change out from under you. Current POSIX provides no help
here.



If you want to
manipulate that information, or expose it to a caller, you can't use FILE *
(or, if you want a direct answer, "you can't").


The if/else staircase in m4 and gnulib and so on says I can.


Not in a way that protects you against changes to one of the underlying
stdio implementations. And isn't that the point? You can always offer that
functionality if you have stable access to stdio internals, but it's not in
the standard.


I was just wondering if there was a _clean_ way to do it. 


OK. Do you think you've gotten an answer to that?



The C99 guys point out they haven't got file descriptors and thus this would
logically belong in posix, for the same reason fileno() does. "But FILE *
doesn't have a way to fetch the file descriptor" was answered by adding
fileno(). That is ALSO grabbing an integer out of the guts of FILE *.


Sure. And adding that to the standard would require the usual things, for
which there's a process.


This exists. It would be nice if it got standardized.


Maybe it would. But that's a different question.


--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/