lloda pushed a commit to branch main in repository guile. commit 7f1ee520de4a0a39e69d9823fab0e3b328afb44a Author: Rob Browning <r...@defaultvalue.org> AuthorDate: Fri Apr 11 12:55:10 2025 -0500
Convert srfi-207.html to texinfo and add to srfi-modules.texi * doc/ref/srfi-207.html: delete. * doc/ref/srfi-modules.texi: integrate texinfo conversion of html. --- doc/ref/srfi-207.html | 417 -------------------------------------- doc/ref/srfi-modules.texi | 499 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 499 insertions(+), 417 deletions(-) diff --git a/doc/ref/srfi-207.html b/doc/ref/srfi-207.html deleted file mode 100644 index 886abffe6..000000000 --- a/doc/ref/srfi-207.html +++ /dev/null @@ -1,417 +0,0 @@ -<!DOCTYPE html> -<html lang="en"> - <head> - <meta charset="utf-8"> - <title>SRFI 207: String-notated bytevectors</title> - <link href="/favicon.png" rel="icon" sizes="192x192" type="image/png"> - <link rel="stylesheet" href="https://srfi.schemers.org/srfi.css" type="text/css"> - <style>pre.example { margin-left: 2em; }</style> - <meta name="viewport" content="width=device-width, initial-scale=1"></head> - <body> - <h1><a href="https://srfi.schemers.org/"><img class="srfi-logo" src="https://srfi.schemers.org/srfi-logo.svg" alt="SRFI logo" /></a>207: String-notated bytevectors</h1> - -<p>by - Daphne Preston-Kendal (external notation), - John Cowan (procedure design), - Wolfgang Corcoran-Mathe (implementation) -</p> - -<h2 id="status">Status</h2> - -<p>This SRFI is currently in <em>final</em> status. Here is <a href="https://srfi.schemers.org/srfi-process.html">an explanation</a> of each status that a SRFI can hold. To provide input on this SRFI, please send email to <code><a href="mailto:srfi+minus+207+at+srfi+dotschemers+dot+org">srfi-207@<span class="antispam">nospam</span>srfi.schemers.org</a></code>. To subscribe to the list, follow <a href="https://srfi.schemers.org/srfi-list-subscribe.html">these instructions</a>. You can [...] -<ul> - <li>Received: 2020-08-15</li> - <li>Draft #1 published: 2020-08-15</li> - <li>Draft #2 published: 2020-08-17</li> - <li>Draft #3 published: 2020-09-09</li> - <li>Draft #4 published: 2020-10-05</li> - <li>Draft #5 published: 2020-10-12</li> - <li>Draft #6 published: 2020-10-15</li> - <li>Draft #7 published: 2020-10-24</li> - <li>Finalized: 2020-10-29</li> - <li>Revised to fix errata: - <ul> - <li>2021-03-10 (Fix <a href="#errata-1">description</a> - of <code>make-bytestring!</code>.)</li> - <li>2025-02-06 (Fix <a href="#errata-2">explanation</a> of how - to compare bytevectors for equality.)</li></ul></li> -</ul> - -<h2 id="abstract">Abstract</h2> - -<p>To ease the human reading and writing of Scheme code involving -binary data that for mnemonic reasons corresponds -as a whole or in part to ASCII-coded text, a notation -for bytevectors is defined which allows printable ASCII characters -to be used literally without being converted to their corresponding -integer forms. In addition, this SRFI provides a set of procedures -known as the bytestring library -for constructing a bytevector from a sequence of integers, -characters, strings, and/or bytevectors, and for manipulating -bytevectors as if they were strings as far as possible. - -<h2 id="rationale">Rationale</h2> - -<p>Binary file formats are usually not self-describing, and if they are, -the descriptive portion is itself binary, which makes it hard for human beings -to interpret. To assist with this problem, it is common to have a -human-readable section at the beginning of the file, or in some cases -at the beginning of each distinct section of the file. -For historical reasons and to avoid text encoding complications, it is usual -for this human-readable section to be expressed as ASCII text.</p> - -<p>For example, ZIP files begin with the hex bytes <code>50 4B</code> -which are the ASCII encoding for the characters "PK", the initials -of Phil Katz, the inventor of ZIP format. As another example, -the GIF image format begins with <code>47 49 46 38 39 61</code>, -the ASCII encoding for "GIF89a", where "89a" is the format version. -A third example is the PNG image format, where the file header -begins <code>89 50 4E 47</code>. The first byte is intentionally -non-ASCII, but the next three are "PNG". Furthermore, a PNG -file is divided into chunks, each of which contains a 4-byte -"chunk type" code. The letters in the chunk type are mnemonics -for its purpose, such as "PLTE" for a palette, "bKGD" for a -default background color, and "iTXt" for descriptive text in UTF-8.</p> - -<p>When bytevectors contain string data of this kind, it is much more tractable for -human programmers to deal with them in the form <code>#u8"recursion"</code> -than in the form <code>#u8(114 101 99 117 114 115 105 111 110)</code>. -This is true even when non-ASCII bytes are incorporated -into the bytevector: the complete 8-bit PNG file header can be written as -<code>#u8"\x89;PNG\r\n\x1A;\n"</code> -instead of <code>#u8(0x89 0x50 0x4E 0x47 0x0D 0x0A 0x1A 0x0A)</code>.</p> - -<p>In addition, this SRFI provides bytevectors with additional procedures that closely resemble those provided for strings. For example, bytevectors can be padded or trimmed, compared case-sensitively or case-insensitively, searched, joined, and split. - -<p>In this specification it is assumed that bytevectors are as defined in R7RS-small section 6.9. Implementations may also consider them equivalent to R6RS bytevectors (R6RS 4.3.4) or <a href="https://srfi.schemers.org/srfi-4/srfi-4.html">SRFI 4</a> <code>u8vector</code>s, depending which kind of homogeneous vectors of unsigned 8-bit integers an implementation supports. - -<h2 id="specification">Specification</h2> - -<p>Most of the procedures of this SRFI begin with <code>bytestring-</code> -in order to distinguish them from other bytevector procedures. -This does not mean that they accept or return a separate bytestring type: -bytestrings and bytevectors are exactly the same type.</p> - -<h3>External notation</h3> - -<p>The basic form of a string-notated bytevector is: - -<blockquote><code>#u8"</code> <var>content</var> <code>"</code></blockquote> - -<p>To avoid character encoding issues within string-notated bytevectors, only printable ASCII characters (that is, Unicode codepoints in the range from U+0020 to U+007E inclusive) are allowed to be used within the <var>content</var> of a string-notated bytevector. All other characters must be expressed through mnemonic or inline hex escapes, and <code>"</code> and <code>\</code> must also be escaped as in normal Scheme strings. - -<p>Within the <var>content</var> of a string-notated bytevector: - -<ul> - <li>the sequence <code>\"</code> represents the integer 34; - <li>the sequence <code>\\</code> represents the integer 92; - <li>the following mnemonic sequences represent the corresponding integers: - <table> - <tr><th>Seq. <th>Integer - <tr><td><code>\a</code> <td>7 - <tr><td><code>\b</code> <td>8 - <tr><td><code>\t</code> <td>9 - <tr><td><code>\n</code> <td>10 - <tr><td><code>\r</code> <td>13 - <tr><td><code>\|</code> <td>124 - </table> - <li>the sequence <code>\x</code> followed by zero or more <code>0</code> characters, followed by one or two hexadecimal digits, followed by <code>;</code> represents the integer specified by the hexadecimal digits; - <li>the sequence <code>\</code> followed by zero or more intraline whitespace characters, followed by a newline, followed by zero or more further intraline whitespace characters, is ignored and corresponds to no entry in the resulting bytevector; - <li>any other printable ASCII character represents the character number of that character in the ASCII/Unicode code chart; and - <li>it is an error to use any other character or sequence beginning with <code>\</code> within a string-notated bytevector. -</ul> - -<p>Note: The <code>\|</code> sequence is provided so that -string parsing, symbol parsing, and string-notated bytevector parsing -can all use the same sequences. -However, we give a complete definition of the valid lexical syntax -in this SRFI rather than inheriting the native syntax of strings, -so that it is clear that <code>#u8"ι"</code> and -<code>#u8"\xE000;"</code> are invalid.</p> -<p>When the Scheme reader encounters a string-notated bytevector, it produces a datum as if that bytevector had been written out in full. That is, <code>#u8"A"</code> is exactly equivalent to <code>#u8(65)</code>. - -<p>A Scheme implementation which supports string-notated bytevectors may not by default use this notation when any of the <code>write</code> family of procedures is called upon a bytevector or upon another datum containing a bytevector. A future SRFI is expected to add a configurable version of the <code>write</code> procedure which may enable the use of this notation in this context. - -<h3>Formal syntax</h3> - -<p>The formal syntax of Scheme (defined in R7RS-small 7.1) is amended as follows. - -<ul> -<li><p>In the definition of ⟨token⟩, after ‘| ⟨string⟩’, insert ‘| ⟨string-notated bytevector⟩’. -<li><p>After the definition of ⟨byte⟩ is inserted: - <blockquote> - <p>⟨string-notated bytevector⟩ → <code>#u8"</code> ⟨string-notated bytevector element⟩* <code>"</code><br> - ⟨string-notated bytevector element⟩ → ⟨any printable ASCII character other than <code>"</code> or <code>\</code>⟩<br> - <span style="margin-left:1em">| ⟨mnemonic escape⟩ | <code>\"</code> | <code>\\</code></span><br> - <span style="margin-left:1em">| <code>\</code>⟨intraline whitespace⟩*⟨line ending⟩⟨intraline whitespace⟩*</span><br> - <span style="margin-left:1em">| ⟨inline hex escape⟩</span> - </blockquote> -</ul> - -<h3>Constructors</h3> - -<p><code>(bytestring</code> <var>arg</var> …<code>)</code></p> -<p>Converts <var>args</var> into a sequence of small integers and returns them as a bytevector as follows:</p> -<ul> - <li> - <p>If <var>arg</var> is an exact integer in the range 0-255 inclusive, it is added to the result.</p> - </li> - <li> - <p>If <var>arg</var> is an ASCII character (that is, its codepoint is in the range 0-127 inclusive), it is converted to its codepoint and added to the result.</p> - </li> - <li> - <p>If <var>arg</var> is a bytevector, its elements are added to the result.</p> - </li> - <li> - <p>If <var>arg</var> is a string of ASCII characters, it is converted to a sequence of codepoints which are added to the result.</p> - </li> -</ul> -<p>Otherwise, an error satisfying <code>bytestring-error?</code> is signaled.</p> -<p>Examples:</p> -<pre class="example"><code>(bytestring "lo" #\r #x65 #u8(#x6d)) ⇒ #u8"lorem" -(bytestring "η" #\space #u8(#x65 #x71 #x75 #x69 #x76)) ⇒</code> <em>error</em> -</pre> - -<p><code>(make-bytestring</code> <var>list</var><code>)</code></p> -<p>If the elements of <var>list</var> are suitable arguments for -<code>bytestring</code>, returns the bytevector that would be the -result of applying <code>bytestring</code> to <var>list</var>. -Otherwise, an error satisfying <code>bytestring-error?</code> is signaled.</p> - -<p id="errata-1"><code>(make-bytestring!</code> <var>bytevector at list</var><code>)</code></p> -<p>If the elements of <var>list</var> are suitable arguments for -<code>bytestring</code>, writes the bytes of the bytevector that would be the -result of calling <code>make-bytestring</code> -into <var>bytevector</var> starting at index <var>at</var>.</p> -<pre class="example"><code>(define bstring (make-bytevector 10 #x20)) -(make-bytestring! bstring 2 '(#\s #\c "he" #u8(#x6d #x65))) -bstring ⇒ #u8" scheme "</code></pre> - -<h3>Conversion</h3> - -<p><code>(bytevector->hex-string</code> <var>bytevector</var><code>)</code><br> -<code>(hex-string->bytevector</code> <var>string</var><code>)</code></p> -<p>Converts between a bytevector and a string containing pairs of hexadecimal digits. -If <var>string</var> is not pairs of hexadecimal digits, an error satisfying <code>bytestring-error?</code> is raised.</p> -<pre class="example"><code>(bytevector->hex-string #u8"Ford") ⇒ "467f7264" -(hex-string->bytevector "5a6170686f64") ⇒ #u8"Zaphod"</code></pre> - -<p><code>(bytevector->base64</code> <var>bytevector</var> [<var>digits</var>]<code>)</code><br> -<code>(base64->bytevector</code> <var>string</var> [<var>digits</var>]<code>)</code></p> -<p>Converts between a bytevector and its base-64 encoding as a string. The 64 digits are represented by the characters 0-9, A-Z, a-z, and the symbols + and /. However, there are different variants of base-64 encoding which use different representations of the 62nd and 63rd digit. If the optional argument <var>digits</var> (a two-character string) is provided, those two characters will be used as the 62nd and 63rd digit instead. -Details can be found in -<a href="https://tools.ietf.org/html/rfc4648">RFC 4648</a>. -If <var>string</var> is not in base-64 format, an error satisfying <code>bytestring-error?</code> is raised. -However, characters that satisfy <code>char-whitespace?</code> -are silently ignored.</p> -<pre class="example"><code>(bytevector->base64 #u8(1 2 3 4 5 6)) ⇒ "AQIDBAUG" -(bytevector->base64 #u8"Arthur Dent") ⇒ "QXJ0aHVyIERlbnQ=" -(base64->bytevector "+/ /+") ⇒ #u8(#xfb #xff #xfe)</code></pre> - -<p><code>(bytestring->list</code> <var>bytevector</var> [ <var>start</var> [ <var>end</var> ] ]<code>)</code></p> -<p>Converts all or part of a bytevector -into a list of the same length containing -characters for elements in the range 32 to 127 -and exact integers for all other elements.</p> -<pre class="example"><code>(bytestring->list #u8(#x41 #x42 1 2) 1 3) ⇒ (#\B 1)</code></pre> - -<p><code>(make-bytestring-generator</code> <var>arg</var> …<code>)</code></p> -<p>Returns a generator that when invoked will return consecutive bytes -of the bytevector that <code>bytestring</code> would create when applied -to <var>args</var>, but without creating any bytevectors. -The <var>args</var> are validated before any bytes are generated; -if they are ill-formed, an error satisfying -<code>bytestring-error?</code> is raised.</p> -<pre class="example"><code>(generator->list (make-bytestring-generator "lorem")) - ⇒ (#x6c #x6f #x72 #x65 #x6d)</code></pre> -<h3>Selection</h3> - -<p><code>(bytestring-pad</code> <var>bytevector len char-or-u8</var><code>)</code><br> -<code>(bytestring-pad-right</code> <var>bytevector len char-or-u8</var><code>)</code></p> -<p>Returns a newly allocated bytevector with the contents of <var>bytevector</var> plus sufficient additional bytes at the beginning/end containing <var>char-or-u8</var> (which can be either an ASCII character or an exact integer in the range 0-255) such that the length of the result is at least <var>len</var>.</p> -<pre class="example"><code>(bytestring-pad #u8"Zaphod" 10 #\_) ⇒ #u8"____Zaphod" -(bytestring-pad-right #u8(#x80 #x7f) 8 0) ⇒ #u8(#x80 #x7f 0 0 0 0 0 0)</code></pre> - -<p><code>(bytestring-trim</code> <var>bytevector pred</var><code>)</code><br> -<code>(bytestring-trim-right</code> <var>bytevector pred</var><code>)</code><br> -<code>(bytestring-trim-both</code> <var>bytevector pred</var><code>)</code></p> -<p>Returns a newly allocated bytevector with the contents of <var>bytevector</var>, except that consecutive bytes at the beginning / the end / both the beginning and the end that satisfy <var>pred</var> are not included.</p> -<pre class="example"><code>(bytestring-trim #u8" Trillian" (lambda (b) (= b #x20))) - ⇒ #u8"Trillian" -(bytestring-trim-both #u8(0 0 #x80 #x7f 0 0 0) zero?) ⇒ #u8(#x80 #x7f)</code></pre> - -<h3>Replacement</h3> - -<p><code>(bytestring-replace</code> <var>bytevector1 bytevector2 start1 end1 [start2 end2]</var><code>)</code></p> -<p>Returns a newly allocated bytevector with the contents of <var>bytevector1</var>, except that the bytes indexed by <var>start1</var> and <var>end1</var> are not included but are replaced by the bytes of <var>bytevector2</var> indexed by <var>start2</var> and <var>end2</var>.</p> -<pre class="example"><code>(bytestring-replace #u8"Vogon torture" #u8"poetry" 6 13) - ⇒ #u8"Vogon poetry"</code></pre> - -<h3>Comparison</h3> - -<p id="errata-2">To compare bytevectors for equality, use the -procedure <code>bytevector=?</code> from -the R6RS library <code>(rnrs bytevectors)</code> or -<code>equal?</code> in R7RS. - -<p><code>(bytestring<?</code> <var>bytevector1 bytevector2</var><code>)</code><br> -<code>(bytestring>?</code> <var>bytevector1 bytevector2</var><code>)</code><br> -<code>(bytestring<=?</code> <var>bytevector1 bytevector2</var><code>)</code><br> -<code>(bytestring>=?</code> <var>bytevector1 bytevector2</var><code>)</code></p> -<p>Returns <code>#t</code> if <var>bytevector1</var> is less than / greater than / less than or equal to / greater than or equal to <var>bytevector2</var>. Comparisons are lexicographical: shorter bytevectors compare before longer ones, all elements being equal.</p> -<pre class="example"><code>(bytestring<? #u8"Heart Of Gold" #u8"Heart of Gold") ⇒ #t -(bytestring<=? #u8(#x81 #x95) #u8(#x80 #xa0)) ⇒ #f -(bytestring>? #u8(1 2 3) #u8(1 2)) ⇒ #t -</code></pre> - -<h3>Searching</h3> - -<p><code>(bytestring-index</code> <var>bytevector pred</var> [<var>start</var> [<var>end</var>]]<code>)</code><br> -<code>(bytestring-index-right</code> <var>bytevector pred</var> [<var>start</var> [<var>end</var>]]<code>)</code></p> -<p>Searches <var>bytevector</var> from <var>start</var> to <var>end</var> / from <var>end</var> to <var>start</var> for the first byte that satisfies <var>pred</var>, and returns the index into <var>bytevector</var> containing that byte. In either direction, <var>start</var> is inclusive and <var>end</var> is exclusive. If there are no such bytes, returns <code>#f</code>.</p> -<pre class="example"><code>(bytestring-index #u8(#x65 #x72 #x83 #x6f) (lambda (b) (> b #x7f))) ⇒ 2 -(bytestring-index #u8"Beeblebrox" (lambda (b) (> b #x7f))) ⇒ #f -(bytestring-index-right #u8"Zaphod" odd?) ⇒ 4 -</code></pre> - -<p><code>(bytestring-break</code> <var>bytevector pred</var><code>)</code><br> -<code>(bytestring-span</code> <var>bytevector pred</var><code>)</code></p> -<p>Returns two values, a bytevector containing the maximal sequence of characters (searching from the beginning of <var>bytevector</var> to the end) that do not satisfy / do satisfy <var>pred</var>, and another bytevector containing the remaining characters.</p> -<pre class="example"><code>(bytestring-break #u8(#x50 #x4b 0 0 #x1 #x5) zero?) - ⇒ #u8(#x50 #x4b) - #u8(0 0 #x1 #x5) -(bytestring-span #u8"ABCDefg" (lambda (b) (and (> b 40) (< b 91)))) - ⇒ #u8"ABCD" - #u8"efg" -</code></pre> - -<h3 id="joining-and-splitting">Joining and splitting</h3> - -<p><code>(bytestring-join</code> <var>bytevector-list delimiter</var> [<var>grammar</var>]<code>)</code></p> -<p>Pastes the bytevectors in <var>bytevector-list</var> together -using the <var>delimiter</var>, -which can be anything suitable as an argument to <code>bytestring</code>. -The <var>grammar</var> -argument is a symbol that determines how the delimiter is used, and -defaults to <code>infix</code>. It is an error for grammar to be -any symbol other than these four:</p> -<ul> - <li><code>infix</code> means an infix or separator grammar: inserts the delimiter between list elements. An empty list will produce an empty bytevector.</li> - <li><code>strict-infix</code> means the same as <code>infix</code> if the list is non-empty, but will signal an error satisfying <code>bytestring-error?</code> if given an empty list.</li> - <li><code>suffix</code> means a suffix or terminator grammar: inserts the delimiter after every list element.</li> - <li><code>prefix</code> means a prefix grammar: inserts the delimiter before every list element.</li> -</ul> -<pre class="example"><code>(bytestring-join '(#u8"Heart" #u8"of" #u8"Gold") #x20) ⇒ #u8"Heart of Gold" -(bytestring-join '(#u8(#xef #xbb) #u8(#xbf)) 0 'prefix) ⇒ #u8(0 #xef #xbb 0 #xbf) -(bytestring-join '() 0 'strict-infix) ⇒</code> <em>error</em></pre> - -<p><code>(bytestring-split</code> <var>bytevector delimiter</var> [<var>grammar</var>]<code>)</code></p> -<p>Divides the elements of <var>bytevector</var> and returns a list of newly allocated bytevectors using the <var>delimiter</var> (an ASCII character or exact integer in the range 0-255 inclusive). Delimiter bytes are not included in the result bytevectors.</p> -<p>The <var>grammar</var> argument is used to control how <var>bytevector</var> is divided. It has the same default and meaning as in <code>bytestring-join</code>, except that <code>infix</code> and <code>strict-infix</code> mean the same thing. That is, if <var>grammar</var> is <code>prefix</code> or <code>suffix</code>, then ignore any delimiter in the first or last position of <var>bytevector</var> respectively.</p> -<pre class="example"><code>(bytestring-split #u8"Beeblebrox" #x62) ⇒ (#u8"Bee" #u8"le" #u8"rox") -(bytestring-split #u8(1 0 2 0) 0 'suffix) ⇒ (#u8(1) #u8(2)) -</code></pre> - -<h3>I/O</h3> - -<code>(read-textual-bytestring</code> <var>prefix</var> [ <var>port</var> ]<code>)</code> -<p>Reads a string in the external format described in this SRFI -from <var>port</var> and return it as a bytevector. -If the <var>prefix</var> argument is false, this procedure assumes -that "<code>#u8</code>" has already been read from <var>port</var>. -If <var>port</var> is omitted, it defaults to the value of <code>(current-input-port)</code>. -If the characters read are not in the external format, -an error satisfying <code>bytestring-error?</code> is raised.</p> -<pre class="example"><code>(call-with-port (open-input-string "#u8\"AB\\xad;\\xf0;\\x0d;CD\"") - (lambda (port) - (read-textual-bytestring #t port))) - ⇒ #u8(#x41 #x42 #xad #xf0 #x0d #x43 #x44) -</code></pre> - -<p><code>(write-textual-bytestring</code> <var>bytevector</var> [ <var>port</var> ]<code>)</code></p> -<p>Writes <var>bytevector</var> in the external format described in this SRFI to <var>port</var>. -Bytes representing non-graphical ASCII characters are unencoded: -all other bytes are encoded with a single letter if possible, -otherwise with a <code>\x</code> escape. -If <var>port</var> is omitted, it defaults to the value of <code>(current-output-port)</code>.</p> -<pre class="example"><code>(call-with-port (open-output-string) - (lambda (port) - (write-textual-bytestring - #u8(#x9 #x41 #x72 #x74 #x68 #x75 #x72 #xa) - port) - (get-output-string port))) - ⇒ "#u8\"\\tArthur\\n\"" -</code></pre> - -<p><code>(write-binary-bytestring</code> <var>port arg</var> …<code>)</code></p> -<p>Outputs each <var>arg</var> to the binary output port <var>port</var> -using the same interpretations as <code>bytestring</code>, -but without creating any bytevectors. -The <var>args</var> are validated before any bytes are written to -<var>port</var>; if they are ill-formed, an error satisfying -<code>bytestring-error?</code> is raised.</p> -<pre class="example"><code>(call-with-port (open-output-bytevector) - (lambda (port) - (write-binary-bytestring port #\Z #x61 #x70 "hod") - (get-output-bytevector port))) - ⇒ #u8"Zaphod" -</code></pre> - -<h3>Exception</h3> - -<p><code>(bytestring-error?</code> <var>obj</var><code>)</code></p> -<p>Returns <code>#t</code> if <var>obj</var> is an object signaled by any of the -following procedures, in the circumstances described above:</p> -<ul> - <li><code>bytestring</code></li> - <li><code>hex-string->bytestring</code></li> - <li><code>base64->bytestring</code></li> - <li><code>make-bytestring</code></li> - <li><code>make-bytestring!</code></li> - <li><code>bytestring-join</code></li> - <li><code>read-textual-bytestring</code></li> - <li><code>write-binary-bytestring</code></li> - <li><code>make-bytestring-generator</code></li> -</ul> - -<h2 id="implementation">Implementation</h2> - -<p>There is a sample implementation of the procedures, -but not the notation, in the repository of this SRFI. - -<h2 id="acknowledgements">Acknowledgements</h2> - -<p>Daphne Preston-Kendal devised the string notation for bytevectors; John Cowan, the procedure library; Wolfgang Corcoran-Mathe, the sample implementation of the procedures. - -<p>The notation is inspired by the notation used in Python since version 2.6 for <code>bytes</code> objects, which are fundamentally similar in purpose to Scheme bytevectors, especially in R7RS. In addition, many of the procedures are closely analogous to those of <a href="https://srfi.schemers.org/srfi-152/srfi-152.html">SRFI 152</a>. - -<p>Thanks is also due to the participants in the SRFI mailing list. In particular: Lassi Kortela corrected an embarrassing technical error; Marc Nieper-Wißkirchen explained why the <code>write</code> procedure ought not to be allowed to use this notation by default. - -<h2 id="copyright">Copyright</h2> -<p>© 2020 Daphne Preston-Kendal, John Cowan, and Wolfgang Corcoran-Mathe.</p> - -<p> - Permission is hereby granted, free of charge, to any person - obtaining a copy of this software and associated documentation files - (the "Software"), to deal in the Software without restriction, - including without limitation the rights to use, copy, modify, merge, - publish, distribute, sublicense, and/or sell copies of the Software, - and to permit persons to whom the Software is furnished to do so, - subject to the following conditions:</p> - -<p> - The above copyright notice and this permission notice (including the - next paragraph) shall be included in all copies or substantial - portions of the Software.</p> -<p> - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE.</p> - - <hr> - <address>Editor: <a href="mailto:srfi-editors+at+srfi+dot+schemers+dot+org">Arthur A. Gleckler</a></address></body></html> diff --git a/doc/ref/srfi-modules.texi b/doc/ref/srfi-modules.texi index 9e3fdb9cc..e003edbc0 100644 --- a/doc/ref/srfi-modules.texi +++ b/doc/ref/srfi-modules.texi @@ -5,6 +5,8 @@ @c Copyright (C) 2005-2006 Per Bothner @c See the file guile.texi for copying conditions. +@c REVIEW: include &bytestring-error in public docs/api? + @node SRFI Support @section SRFI Support Modules @cindex SRFI @@ -68,6 +70,7 @@ get the relevant SRFI documents from the SRFI home page * SRFI-119:: Wisp: simpler indentation-sensitive Scheme. * SRFI-171:: Transducers * SRFI-197:: Pipeline operators +* SRFI-207:: String-notated bytevectors @end menu @@ -7369,6 +7372,502 @@ Esterhazy for the (EPL licensed) threading macros documentation page, which was a source of inspiration and some of the examples in this document. + +@node SRFI-207 +@subsection SRFI-207 String-notated bytevectors +@cindex SRFI-207 +@cindex keyword objects + +@uref{http://srfi.schemers.org/srfi-207/srfi-207.html, SRFI-207} +provides a more human-friendly representation for binary-data via an +ASCII text notation for @pxref{Bytevectors}. It also provides +bytestring-oriented procedures for constructing bytevectors from +sequences of integers, characters, strings, and other bytevectors, and +procedures for manipulating bytevectors as if they were strings. + +Binary file formats are usually not self-describing, and if they are, +the descriptive portion is itself binary, which makes it hard for human +beings to interpret. To assist with this problem, it is common to have +a human-readable section at the beginning of the file, or in some cases +at the beginning of each distinct section of the file. For historical +reasons and to avoid text encoding complications, it is usual for this +human-readable section to be expressed as ASCII text. + +For example, ZIP files begin with the hex bytes @code{50 4B} which are +the ASCII encoding for the characters "PK", the initials of Phil Katz, +the inventor of ZIP format. As another example, the GIF image format +begins with @code{47 49 46 38 39 61}, the ASCII encoding for "GIF89a", +where "89a" is the format version. A third example is the PNG image +format, where the file header begins @code{89 50 4E 47}. The first byte +is intentionally non-ASCII, but the next three are "PNG". Furthermore, +a PNG file is divided into chunks, each of which contains a 4-byte +"chunk type" code. The letters in the chunk type are mnemonics for its +purpose, such as "PLTE" for a palette, "bKGD" for a default background +color, and "iTXt" for descriptive text in UTF-8. + +When bytevectors contain string data of this kind, it is much more +tractable for human programmers to deal with +@code{#u8"\x89;PNG\r\n\x1A;\n"} rather than @code{#u8(0x89 0x50 0x4E +0x47 0x0D 0x0A 0x1A 0x0A)}. + +In addition, this SRFI provides bytevectors with additional procedures +that closely resemble those provided for strings. For example, +bytevectors can be padded or trimmed, compared case-sensitively or +case-insensitively, searched, joined, and split. + +Most of the procedures of this SRFI begin with @code{bytestring-} in +order to distinguish them from other bytevector procedures. This does +not mean that they accept or return a separate bytestring type: +bytestrings and bytevectors are exactly the same type. + + +@node SRFI-207 External Notation +@subsubsection External Notation +@cindex bytestring notation + +The basic form of a string-notated bytevector is @code{#u8"CONTENT"}. + +To avoid character encoding issues within string-notated bytevectors, +only printable ASCII characters (that is, Unicode codepoints in the +range from U+0020 to U+007E inclusive) are allowed to be used within the +@var{CONTENT} of a string-notated bytevector. All other characters must +be expressed through mnemonic or inline hex escapes, and @code{"} and +@code{\} must also be escaped as in normal Scheme strings. + +Within the @var{CONTENT} of a string-notated bytevector: + +@itemize @bullet + @item + @code{\a} @result{} 7 + @item + @code{\b} @result{} 8 + @item + @code{\t} @result{} 9 + @item + @code{\n} @result{} 10 + @item + @code{\r} @result{} 13 + @item + @code{\"} @result{} 34 + @item + @code{\\} @result{} 92 + @item + @code{\|} @result{} 124 +@item +the sequence @code{\x} followed by zero or more @code{0} characters, +followed by one or two hexadecimal digits, followed by @code{;} +represents the integer specified by the hexadecimal digits; +@item +the sequence @code{\} followed by zero or more intraline whitespace +characters, followed by a newline, followed by zero or more further +intraline whitespace characters, is ignored and corresponds to no entry +in the resulting bytevector; +@item +any other printable ASCII character represents the character number of +that character in the ASCII/Unicode code chart; and +@item +it is an error to use any other character or sequence beginning with +@code{\} within a string-notated bytevector. +@end itemize + +Note: The @code{\|} sequence is provided so that string parsing, symbol +parsing, and string-notated bytevector parsing can all use the same +sequences. However, we give a complete definition of the valid lexical +syntax in this SRFI rather than inheriting the native syntax of strings, +so that it is clear that @code{#u8"ι"} and @code{#u8"\xE000;"} are +invalid. + +When the Scheme reader encounters a string-notated bytevector, it +produces a datum as if that bytevector had been written out in +full. That is, @code{#u8"A"} is exactly equivalent to @code{#u8(65)}. + + +@node SRFI-207 Contructors +@subsubsection Constructors +@cindex bytestring constructors + +@deffn {Scheme Procedure} bytestring part @dots{} +Converts earch @var{part} into a sequence of small integers and returns +a bytevector of the corresponding bytes as follows: +@itemize +@item +If @var{part} is an exact integer in the range 0-255 inclusive, it is +added to the result. +@item +If @var{part} is an ASCII character (that is, its codepoint is in the +range 0-127 inclusive), it is converted to its codepoint and added to +the result. +@item +If @var{part} is a bytevector, its elements are added to the result. +@item +If @var{part} is a string of ASCII characters, it is converted to a +sequence of codepoints which are added to the result. +@end itemize + +Otherwise, an error satisfying @code{bytestring-error?} is signaled, for +example: + +@example +(bytestring "lo" #\r #x65 #u8(#x6d)) @result{} #u8"lorem" +@end example + +@example +(bytestring "η" #\space #u8(#x65 #x71 #x75 #x69 #x76)) +@result{} raised &bytestring-error +@end example + +@end deffn + +@deffn {Scheme Procedure} make-bytestring parts +If the @var{parts} are suitable arguments for @code{bytestring}, returns +the bytevector that would result from applying @code{bytestring} to +them. Otherwise, an error satisfying @code{bytestring-error?} is +raised. +@end deffn + +@deffn {Scheme Procedure} make-bytestring! bytevector at parts +If the @var{parts} are suitable arguments for @code{bytestring}, writes +the bytes of the bytevector that would be the result of calling +@code{make-bytestring} into @var{bytevector} starting at index +@var{at}. For example: + +@example +(define bv (make-bytevector 10 #x20)) +(make-bytestring! bv 2 '(#\s #\c "he" #u8(#x6d #x65))) bv) +@result{} #u8" scheme " +@end example +@end deffn + + +@node SRFI-207 Conversion +@subsubsection Conversion +@cindex bytestring conversion + +@deffn {Scheme Procedure} bytevector->hex-string bytevector +@deffnx {Scheme Procedure} hex-string->bytevecto string +Converts between a bytevector and a string containing pairs of +hexadecimal digits. If @var{string} is not pairs of hexadecimal digits, +an error satisfying @code{bytestring-error?} is raised + +@example +@code{(bytevector->hex-string #u8"Ford")} @result{} @code{"467f7264"} +@code{(hex-string->bytevector "5a6170686f64")} @result{} @code{#u8"Zaphod")} +@end example +@end deffn + +@deffn {Scheme Procedure} bytevector->base64 bytevector [digits] +@deffnx {Scheme Procedure} base64->bytevecto string [digits] +Converts between a bytevector and its base-64 encoding as a string. The +64 digits are represented by the characters 0-9, A-Z, a-z, and the +symbols + and /. However, there are different variants of base-64 +encoding which use different representations of the 62nd and 63rd +digit. If the optional argument @var{digits} (a two-character string) is +provided, those two characters will be used as the 62nd and 63rd digit +instead. Details can be found in +@url{https://tools.ietf.org/html/rfc4648, RFC 4648}. + +If @var{string} is not in base-64 format, an error satisfying +@code{bytestring-error?} is raised. However, characters that satisfy +@code{char-whitespace?} are silently ignored. + +@example +@code{(bytevector->base64 #u8(1 2 3 4 5 6))} @result{} @result{} @code{"AQIDBAUG"} +@code{(bytevector->base64 #u8"Arthur Dent")} @result{} @code{"QXJ0aHVyIERlbnQ="} +@code{(base64->bytevector "+/ /+")} @result{} @code{#u8(#xfb #xff #xfe)} +@end example +@end deffn + +@deffn {Scheme Procedure} bytestring->list bytevector [start [end]] +Converts all or part of a bytevector into a list of the same length +containing characters for elements in the range 32 to 127 and exact +integers for all other elements.</p> + +@example +@code{(bytestring->list #u8(#x41 #x42 1 2) 1 3)} @result{} @code{(#\B 1)} +@end example +@end deffn + +@deffn {Scheme Procedure} make-bytestring-generator arg @dots{} +Returns a generator that when invoked will return consecutive bytes of +the bytevector that @code{bytestring} would create when applied to +@var{args}, but without creating any bytevectors. The @var{args} are +validated before any bytes are generated; if they are ill-formed, an +error satisfying @code{bytestring-error?} is raised. + +@example +@code{(generator->list (make-bytestring-generator "lorem"))} +@result{} @code{(#x6c #x6f #x72 #x65 #x6d)} +@end example +@end deffn + + +@node SRFI-207 Selection +@subsubsection Selection +@cindex bytestring selection + +@deffn {Scheme Procedure} bytestring-pad bytevector len char-or-u8 +@deffnx {Scheme Procedure} bytestring-pad-right bytevector len char-or-u8 +Returns a newly allocated bytevector with the contents of +@var{bytevector} plus sufficient additional bytes at the beginning/end +containing @var{char-or-u8} (which can be either an ASCII character or +an exact integer in the range 0-255) such that the length of the result +is at least @var{len}. + +@example +@code{(bytestring-pad #u8"Zaphod" 10 #\_)} @result{} @code{#u8"____Zaphod"} +@code{(bytestring-pad-right #u8(#x80 #x7f) 8 0)} @result{} @code{#u8(#x80 #x7f 0 0 0 0 0 0)} +@end example +@end deffn + + +@deffn {Scheme Procedure} bytestring-trim bytevector pred +@deffnx {Scheme Procedure} bytestring-trim-right bytevector pred +@deffnx {Scheme Procedure} bytestring-trim-both bytevector pred +Returns a newly allocated bytevector with the contents of +@var{bytevector}, except that consecutive bytes at the beginning / the +end / both the beginning and the end that satisfy @var{pred} are not +included. + +@example +@code{(bytestring-trim #u8" Trillian" (lambda (b) (= b #x20)))} +@result{} @code{#u8"Trillian"} +@code{(bytestring-trim-both #u8(0 0 #x80 #x7f 0 0 0) zero?)} @result{} @code{#u8(#x80 #x7f)} +@end example +@end deffn + + +@node SRFI-207 Replacement +@subsubsection Replacement +@cindex bytestring replacement + +@deffn {Scheme Procedure} bytestring-replace bytevector1 bytevector2 start1 end1 [start2 end2] +Returns a newly allocated bytevector with the contents of +@var{bytevector1}, except that the bytes indexed by @var{start1} and +@var{end1} are not included but are replaced by the bytes of +@var{bytevector2} indexed by @var{start2} and @var{end2}. + +@example +@code{(bytestring-replace #u8"Vogon torture" #u8"poetry" 6 13)} +@result{} @code{#u8"Vogon poetry"} +@end example +@end deffn + + +@node SRFI-207 Comparison +@subsubsection Comparison +@cindex bytestring comparison + +To compare bytevectors for equality, use the @code{bytevector=?} +procedure from @code{(rnrs bytevectors)} (@pxref{Bytevectors}) or +@code{equal?}. + +@deffn {Scheme Procedure} bytestring<? bytevector1 bytevector2 +@deffnx {Scheme Procedure} bytestring>? bytevector1 bytevector2 +@deffnx {Scheme Procedure} bytestring<=? bytevector1 bytevector2 +@deffnx {Scheme Procedure} bytestring>=? bytevector1 bytevector2 +Returns @code{#t} if @var{bytevector1} is less than / greater than / +less than or equal to / greater than or equal to +@var{bytevector2}. Comparisons are lexicographical: shorter bytevectors +compare before longer ones, all elements being equal. + +@example +@code{(bytestring<? #u8"Heart Of Gold" #u8"Heart of Gold")} @result{} @code{#t} +@code{(bytestring<=? #u8(#x81 #x95) #u8(#x80 #xa0))} @result{} @code{#f} +@code{(bytestring>? #u8(1 2 3) #u8(1 2))} @result{} @code{#t} +@end example +@end deffn + + +@node SRFI-207 Searching +@subsubsection Searching +@cindex bytestring searching + +@deffn {Scheme Procedure} bytestring-index bytevector pred [start [end]] +@deffnx {Scheme Procedure} bytestring-index-right bytevector pred [start [end]] +Searches @var{bytevector} from @var{start} to @var{end} / from +@var{end} to @var{start} for the first byte that satisfies @var{pred}, +and returns the index into @var{bytevector} containing that byte. In +either direction, @var{start} is inclusive and @var{end} is +exclusive. If there are no such bytes, returns @code{#f}. + +@example +@code{(bytestring-index #u8(#x65 #x72 #x83 #x6f) (λ (b) (> b #x7f)))} @result{} @code{2} +@code{(bytestring-index #u8"Beeblebrox" (λ (b) (> b #x7f)))} @result{} @code{#f} +@code{(bytestring-index-right #u8"Zaphod" odd?)} @result{} @code{4} +@end example +@end deffn + +@deffn {Scheme Procedure} bytestring-break bytevector pred +@deffnx {Scheme Procedure} bytestring-span bytevector pred +Returns two values, a bytevector containing the maximal sequence of +characters (searching from the beginning of @var{bytevector} to the end) +that do not satisfy / do satisfy @var{pred}, and another bytevector +containing the remaining characters. + +@example +@code{(bytestring-break #u8(#x50 #x4b 0 0 #x1 #x5) zero?)} + @result{} @code{#u8(#x50 #x4b)} @code{#u8(0 0 #x1 #x5)} +@code{(bytestring-span #u8"ABCDefg" (lambda (b) (and (> b 40) (< b 91))))} + @result{} @code{#u8"ABCD"} @code{#u8"efg"} +@end example +@end deffn + + +@node SRFI-207 Joining And Splitting +@subsubsection Joining And Splitting +@cindex bytestring joining and splitting + +@deffn {Scheme Procedure} bytestring-join bytevector-list delimiter +Joins the bytevectors in @var{bytevector-list} together using the +@var{delimiter}, which can be anything suitable as an argument to +@code{bytestring}. The @var{grammar} argument is a symbol that +determines how the delimiter is used, and defaults to @code{infix}. It +is an error for grammar to be any symbol other than these four: + +@table @code +@item infix +means an infix or separator grammar: inserts the delimiter between list +elements. An empty list will produce an empty bytevector +@item strict-infix +means the same as @code{infix} if the list is non-empty, but will signal +an error satisfying @code{bytestring-error?} if given an empty +list. +@item suffix +means a suffix or terminator grammar: inserts the delimiter after every list element. +@item prefix +means a prefix grammar: inserts the delimiter before every list element. +@end table + +For example: +@example +@code{(bytestring-join '(#u8"Heart" #u8"of" #u8"Gold") #x20)} + @result{} @code{#u8"Heart of Gold"} +@code{(bytestring-join '(#u8(#xef #xbb) #u8(#xbf)) 0 'prefix)} + @result{} @code{#u8(0 #xef #xbb 0 #xbf)} +@code{(bytestring-join '() 0 'strict-infix)} + @result{} @result{} raised @code{&bytestring-error} +@end example +@end deffn + +@deffn {Scheme Procedure} bytestring-split bytevector delimiter [grammar] +Divides the elements of @var{bytevector} and returns a list of newly +allocated bytevectors using the @var{delimiter} (an ASCII character or +exact integer in the range 0-255 inclusive). Delimiter bytes are not +included in the result bytevectors. + +The @var{grammar} argument is used to control how @var{bytevector} is +divided. It has the same default and meaning as in +@code{bytestring-join}, except that @code{infix} and @code{strict-infix} +mean the same thing. That is, if @var{grammar} is @code{prefix} or +@code{suffix}, then ignore any delimiter in the first or last position +of @var{bytevector} respectively. + +@example +@code{(bytestring-split #u8"Beeblebrox" #x62)} + @result{} @code{(#u8"Bee" #u8"le" #u8"rox")} +@code{(bytestring-split #u8(1 0 2 0) 0 'suffix)} + @result{} @code{(#u8(1) #u8(2))} +@end example +@end deffn + + +@node SRFI-207 I/O +@subsubsection I/O +@cindex bytestring I/O + +@deffn {Scheme Procedure} read-textual-bytestring prefix [port] +Reads a string in the external format described in this SRFI from +@var{port} and return it as a bytevector. If the @var{prefix} argument +is false, this procedure assumes that "@code{#u8}" has already been read +from @var{port}. If @var{port} is omitted, it defaults to the value of +@code{(current-input-port)}. If the characters read are not in the +external format, an error satisfying @code{bytestring-error?} is raised. + +@example scheme +(call-with-port + (open-input-string "#u8\"AB\\xad;\\xf0;\\x0d;CD\"") + (lambda (port) (read-textual-bytestring #t port))) +@result{} #u8(#x41 #x42 #xad #xf0 #x0d #x43 #x44) +@end example +@end deffn + +@deffn {Scheme Procedure} write-textual-bytestring bytevector [port] +Writes @var{bytevector} in the external format described in this SRFI to +@var{port}. Bytes representing non-graphical ASCII characters are +unencoded: all other bytes are encoded with a single letter if possible, +otherwise with a @code{\x} escape. If @var{port} is omitted, it +defaults to the value of @code{(current-output-port)}. + +@example scheme +(call-with-port + (open-output-string) + (lambda (port) + (write-textual-bytestring + #u8(#x9 #x41 #x72 #x74 #x68 #x75 #x72 #xa) + port) + (get-output-string port))) +@result{} "#u8\"\\tArthur\\n\"" +@end example +@end deffn + +@deffn {Scheme Procedure} write-binary-bytestring port arg @dots{} +Outputs each @var{arg} to the binary output port @var{port} using the +same interpretations as @code{bytestring}, but without creating any +bytevectors. The @var{args} are validated before any bytes are written +to @var{port}; if they are ill-formed, an error satisfying +@code{bytestring-error?} is raised. + +@example scheme +(call-with-port + (open-output-bytevector) + (lambda (port) + (write-binary-bytestring port #\Z #x61 #x70 "hod") + (get-output-bytevector port))) +@result{} #u8"Zaphod" +@end example +@end deffn + + +@node SRFI-207 Exceptions +@subsubsection Exceptions +@cindex bytestring exceptions + +@deffn {Scheme Procedure} bytestring-error? obj +Returns @code{#t} if @var{obj} is a @code{&bytestring-error} signaled by +any of the following procedures, in the circumstances they describe: + +@itemize @w{} +@item @code{bytestring} +@item @code{hex-string->bytestring} +@item @code{base64->bytestring} +@item @code{make-bytestring} +@item @code{make-bytestring!} +@item @code{bytestring-join} +@item @code{read-textual-bytestring} +@item @code{write-binary-bytestring} +@item @code{make-bytestring-generator} +@end itemize +@end deffn + +@node SRFI-207 Acknowledgements +@subsubsection Acknowledgements +@cindex bytestring acknowledgements + +Daphne Preston-Kendal devised the string notation for bytevectors; John +Cowan, the procedure library; Wolfgang Corcoran-Mathe, the original, +sample implementation of the procedures. + +The notation is inspired by the notation used in Python since version +2.6 for @code{bytes} objects, which are fundamentally similar in purpose +to Scheme bytevectors, especially in R7RS. In addition, many of the +procedures are closely analogous to those of +@url{https://srfi.schemers.org/srfi-152/srfi-152.html, SRFI 152}. + +Thanks is also due to the participants in the SRFI mailing list. In +particular: Lassi Kortela corrected an embarrassing technical error; +Marc Nieper-Wißkirchen explained why the @code{write} procedure ought +not to be allowed to use this notation by default. + @c srfi-modules.texi ends here @c Local Variables: