On 4/11/22 14:20, Another Person Who Did Not CC The List wrote: > On Mon, Apr 11, 2022 at 01:48:06PM -0500, Rob Landley via austin-group-l at > The Open Group wrote: >> A bunch of protocols (git, http, mbox, etc) start with lines of data >> followed by >> a block of data, so it's natural to want to call getline() and then handle >> the >> data block. But getline() takes a FILE * and things like zlib and sendfile() >> take an integer file descriptor. >> >> Posix lets me get the file descriptor out of a FILE * with fileno(), but the >> point of FILE * is to readahead and buffer. How do I get the buffered data >> out >> without reading more from the file descriptor? >> >> I can't find a portable way to do this? > > Can you forgo using stdio entirely and just use open? Most > API's that take a file descriptor are generally prepared to > do large I/O requests rather than character-by-character I/O > anyway. Modern kernels will buffer the file in the file/buffer > cache.
That's what I did for years (toybox had a get_line() function instead of getline() that did byte at a time reads with realloc()), introduced in 2007: https://github.com/landley/toybox/blob/bc07865a504c/lib/lib.c#L594 But the Android guys complained it was really slow and ugly, which it was. (Sure Linux buffers it but a system call per byte is still nuts, drives the scheduler batty, makes your strace output impossible to follow...) Plus I was reimplementing a libc API which I try not to do without a really good reason, so when getline() was added to posix in 2008 I switched over MOST users to the new posix api... But the byte-at-a-time get_line() can't QUITE go away: https://github.com/landley/toybox/commit/15cbb92dffc8 (Hands off to sendfile() when it's out of hunks...) And it keeps wanting to come back: https://github.com/landley/toybox/commit/601828982a53 (Adding gzip/deflate support means http 1.1 data lines are followed by gzipped data payload.) Last week a contributor implementing a subset of "git" is encountering the same problem... http://lists.landley.net/pipermail/toybox-landley.net/2022-April/012817.html (Internally git's file format is a bunch of "keyword: value" text lines followed by payload.) It's come up other times over the years. Perennial problem. Unix has been doing "keyword:value lines followed by payload" since the days of mbox files. Sure http bodies started out as text, but they didn't stay that way... It seems like a simple question to ask a File *, "what data you have buffered". The reply "we're not capable of answering that question, therefore the programmer shouldn't ever want to ask it" seems... fixable? Rob (Note that if I implement my own get_line() with extra leftover buffer handed off between line reads to avoid the byte-at-a-time inefficiency, I've reinvented FILE *. And none of the above use cases guarantee seekable input so it can't put extra data it read data BACK into the file descriptor.)
