Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
Hey. On Tue, 2022-04-19 at 01:52 +0100, Harald van Dijk wrote: > Even I did not to apply this to pattern matching. The > lexical locale, the locale used for lexing, is only used for lexing, > i.e. for recognising tokens, not to how those tokens are then > interpreted later on. If locale comes into play for that, as it does > in > pattern matching, it is the then-current value of LC_CTYPE that comes > into play, as it does in other shells. So... how is (as per the standard) it intended to work? My understanding was that if during lexing it sees a pattern '*∈' it would store the binary representation (as following from the lexical locale, in which the shell script/input is in principle expected to be) of these characters for the pattern. But when the actual pattern matching is done, it would interpret that binary representation with respect to the current locale (LC_CTYPE). So if by then, then binary representation of the script's '*∈' would mean '*z?' in the current locale, it would use that meaning as the pattern. Does that sound right? '∈' not being a member of the portable character set would make it, AFAIU, in principle valid for being mapped to `z?` in another locale - while changing the mapping of '*' would be possible, but according to POSIX produce undefined results. ("If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, if an application uses any pair of locales where the character encodings differ, or accesses data from an application using a locale which has different encodings from the locales used by the application, the results are unspecified.") > As for future directions, no opinion on that from me. That would IMO only make sense, if e.g. there was only one and not even well maintained shell that behaves different from all others. The "future directions" would indicate to possible new implementers where things may go and what they should do. 10 years later, one could re-visit the topic, and if that one shell that behaved different from all others had died in the meantime, and any possible new ones followed the future directions... one could standardise it. If not, one could simply leave everything as is and no one would get into troubles. Whether such approach actually works out as intended is of course not guaranteed. > I would not think this should be a special case: «${foo%.}» should > strip > a trailing «.» in exactly those cases where the shell considers foo > to > match the pattern «*.». However, I can see value in doing some extra > tests to verify that this matches what shells do. Remember that it might not be enough to check whether such shell strip off correctly when one has the case but also the case where one or more trailing bytes of the first group and the bytes of the valid character form a new valid character. While this wouldn't be possible if '.' is the characters (because of it's special properties)... it can happen with other characters in some special locales. > Very well, I will post tests and test results as soon I can make the > time for it. Thanks. FYI: I think the outcome will also affect the current proposal for #1561: https://www.austingroupbugs.net/view.php?id=1561#c5795 in specific the part: On page 2321 line 74857 section 2.6.2 Parameter Expansion, change: Thanks, Chris.
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On 15/04/2022 04:57, Christoph Anton Mitterer wrote: On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l at The Open Group wrote: Hmm, I would. I like that :-D This would have been the preferred alternative I've asked for to look at, in the ticket. Shells are not in agreement on whether such single bytes can be matched with [...], nor in those shells where they can be, whether multiple bracket expressions can be used to match the individual bytes of a valid multi-byte character. The cases with [...] only come up when scripts themselves use patterns that are not valid character strings You mean in the lexical locale? I do not, but interesting question. I am one of the few, if not only, shell authors that actually implemented "Changing the value of LC_CTYPE after the shell has started shall not affect the lexical processing of shell commands in the current shell execution environment or its subshells" rule. Even I did not to apply this to pattern matching. The lexical locale, the locale used for lexing, is only used for lexing, i.e. for recognising tokens, not to how those tokens are then interpreted later on. If locale comes into play for that, as it does in pattern matching, it is the then-current value of LC_CTYPE that comes into play, as it does in other shells. they are unlikely to affect existing scripts and I imagine there is not much harm in leaving those unspecified. It should however be clearly described that behaviour in this field is undefined, perhaps with some "future directions" that this might change some day. I prefer explicit over implicit as well myself. Perhaps it does not even need to be undefined though, perhaps unspecified with a few limited options is good enough. I am not sure at this time whether that is feasible. As for future directions, no opinion on that from me. The cases with * and ? do come up in existing scripts, but if shells are in agreement as they appear to be, there is no need to coordinate with shell authors on whether they would be willing to change their implementations, it is possible to change POSIX to describe the shells' current behaviour. Well but it's not only * and ? ... it's also a single character matching that character in a byte string that contains bytes or sequences thereof which do not form any valid character ... both before or after that character to be matched. Yes, I did mention those earlier on in my message but forgot to repeat it here. It's where shells also appear to be in agreement, except in the same corner case that also applies to [...] where an invalid byte in a pattern is used to match part of a valid character in the string. And since pattern matching notation isn't just used for matching alone, but e.g. also for string manipulation in parameter expansion (e.g. "${foo%.}" case)... these shells would also need to agree how to handle that, wouldn't they? I would not think this should be a special case: «${foo%.}» should strip a trailing «.» in exactly those cases where the shell considers foo to match the pattern «*.». However, I can see value in doing some extra tests to verify that this matches what shells do. If there is interest in getting this standardised, I can spend some more time on creating some hopefully comprehensive tests for this to confirm in what cases shells agree and disagree, and use that as a basis for proposing wording to cover it. I'd love to see that and if you'd actually do so, I'd kindly ask Geoff to defer any changes in the ticket #1564 of mine, until it can be said whether it might be possible to get that standardised. Very well, I will post tests and test results as soon I can make the time for it. Cheers, Harald van Dijk
Re: How do I get the buffered bytes in a FILE *?
Sigh, spam filter impounded this. (Gotta move off gmail...) On 4/12/22 04:42, Geoff Clare via austin-group-l at The Open Group wrote: > Rob Landley wrote, on 11 Apr 2022: >> >> A bunch of protocols (git, http, mbox, etc) start with lines of data >> followed by >> a block of data, so it's natural to want to call getline() and then handle >> the >> data block. But getline() takes a FILE * and things like zlib and sendfile() >> take an integer file descriptor. >> >> Posix lets me get the file descriptor out of a FILE * with fileno(), but the >> point of FILE * is to readahead and buffer. How do I get the buffered data >> out >> without reading more from the file descriptor? >> >> I can't find a portable way to do this? > > I tried this sequence of calls on a few systems, and it worked in the > way you would expect: > > fgets(buf, sizeof buf, fp); > int fd = dup(fileno(fp)); > close(fileno(fp)); > while ((ret = fread(buf, 1, sizeof buf, fp)) > 0) { ... } > read(fd, buf, sizeof buf); > > It relies on fread() not detecting EBADF until it tries to read more > data from the underlying fd. Hmmm. That's an interesting approach. > It has some caveats: > > 1. It needs a file descriptor to be available. Understood, but acceptable. > 2. The close() will remove any fcntl() locks that the calling process >holds for the file. Fine. > 3. In a multi-threaded process it has the usual problem around fd >inheritance, but that's addressed in Issue 8 with the addition >of dup3(). Threads break everything anyway, but you could dup2(/dev/null) if you cared about them. > Also, for the standard to require it to work, I think we would need to > tweak the EBADF error for fgetc() (which fread() references) to say: > > The file descriptor underlying stream is not a valid file > descriptor open for reading and there is no buffered data > available to be returned. > > (adding the "and ..." part). Sounds reasonable. I'll give this a try. Thanks, Rob
Re: 答复: How do I get the buffered bytes in a FILE *?
On 4/18/22 12:53 AM, Rob Landley wrote: On 4/17/22 18:10, Chet Ramey wrote: On 4/16/22 2:58 PM, Rob Landley via austin-group-l at The Open Group wrote: Q) "How do I switch from FILE * to fd via fileno() without losing data." A) "Don't use FILE *" That's not the question I asked? The answer is correct, but incomplete. The missing piece is that if you want to use FILE *, the operation you want, and the information you need to implement it, are not part of the public API. Which is a fixable problem. Sure, everything's fixable. It's not what you asked, though. Other than using a strategy like Geoff suggested early on, or trying something like setvbuf to turn off buffering on the FILE * completely, the buffer associated with a FILE * and the indexes into it that say how much data you've consumed from the underlying source are opaque. https://github.com/coreutils/gnulib/blob/master/lib/freadahead.c So the gnulib folks looked at a bunch of different stdio implementations and used non-public (or at least non-standard) portions of the implementation to agument the stdio API. If that's what you want to do, propose adding freadahead to the standard. Or reimplement the gnulib work and accept that the stdio implementation can potentially change out from under you. Current POSIX provides no help here. If you want to manipulate that information, or expose it to a caller, you can't use FILE * (or, if you want a direct answer, "you can't"). The if/else staircase in m4 and gnulib and so on says I can. Not in a way that protects you against changes to one of the underlying stdio implementations. And isn't that the point? You can always offer that functionality if you have stable access to stdio internals, but it's not in the standard. I was just wondering if there was a _clean_ way to do it. OK. Do you think you've gotten an answer to that? The C99 guys point out they haven't got file descriptors and thus this would logically belong in posix, for the same reason fileno() does. "But FILE * doesn't have a way to fetch the file descriptor" was answered by adding fileno(). That is ALSO grabbing an integer out of the guts of FILE *. Sure. And adding that to the standard would require the usual things, for which there's a process. This exists. It would be nice if it got standardized. Maybe it would. But that's a different question. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/