On 4/11/22 14:20, Another Person Who Did Not CC The List wrote:
> On Mon, Apr 11, 2022 at 01:48:06PM -0500, Rob Landley via austin-group-l at 
> The Open Group wrote:
>> A bunch of protocols (git, http, mbox, etc) start with lines of data 
>> followed by
>> a block of data, so it's natural to want to call getline() and then handle 
>> the
>> data block. But getline() takes a FILE * and things like zlib and sendfile()
>> take an integer file descriptor.
>> 
>> Posix lets me get the file descriptor out of a FILE * with fileno(), but the
>> point of FILE * is to readahead and buffer. How do I get the buffered data 
>> out
>> without reading more from the file descriptor?
>> 
>> I can't find a portable way to do this?
> 
> Can you forgo using stdio entirely and just use open?   Most
> API's that take a file descriptor are generally prepared to
> do large I/O requests rather than character-by-character I/O
> anyway.   Modern kernels will buffer the file in the file/buffer
> cache.

That's what I did for years (toybox had a get_line() function instead of
getline() that did byte at a time reads with realloc()), introduced in 2007:

https://github.com/landley/toybox/blob/bc07865a504c/lib/lib.c#L594

But the Android guys complained it was really slow and ugly, which it was. (Sure
Linux buffers it but a system call per byte is still nuts, drives the scheduler
batty, makes your strace output impossible to follow...) Plus I was
reimplementing a libc API which I try not to do without a really good reason, so
when getline() was added to posix in 2008 I switched over MOST users to the new
posix api...

But the byte-at-a-time get_line() can't QUITE go away:

https://github.com/landley/toybox/commit/15cbb92dffc8

(Hands off to sendfile() when it's out of hunks...)

And it keeps wanting to come back:

https://github.com/landley/toybox/commit/601828982a53

(Adding gzip/deflate support means http 1.1 data lines are followed by gzipped
data payload.)

Last week a contributor implementing a subset of "git" is encountering the same
problem...

http://lists.landley.net/pipermail/toybox-landley.net/2022-April/012817.html

(Internally git's file format is a bunch of "keyword: value" text lines followed
by payload.)

It's come up other times over the years. Perennial problem. Unix has been doing
"keyword:value lines followed by payload" since the days of mbox files. Sure
http bodies started out as text, but they didn't stay that way...

It seems like a simple question to ask a File *, "what data you have buffered".
The reply "we're not capable of answering that question, therefore the
programmer shouldn't ever want to ask it" seems... fixable?

Rob

(Note that if I implement my own get_line() with extra leftover buffer handed
off between line reads to avoid the byte-at-a-time inefficiency, I've reinvented
FILE *. And none of the above use cases guarantee seekable input so it can't put
extra data it read data BACK into the file descriptor.)

  • How do I get the buff... Rob Landley via austin-group-l at The Open Group
    • Re: How do I get... Rob Landley via austin-group-l at The Open Group
    • Re: How do I get... Rob Landley via austin-group-l at The Open Group
    • Re: How do I get... Rob Landley via austin-group-l at The Open Group
      • 答复: How do ... Danny Niu via austin-group-l at The Open Group
        • Re: 答复: ... Rob Landley via austin-group-l at The Open Group
          • Re: ... Chet Ramey via austin-group-l at The Open Group
            • ... Rob Landley via austin-group-l at The Open Group
              • ... Chet Ramey via austin-group-l at The Open Group
              • ... Rob Landley via austin-group-l at The Open Group
    • Re: How do I get... Geoff Clare via austin-group-l at The Open Group

Reply via email to