Hello! l...@gnu.org (Ludovic Courtès) writes:
> I’ve just pushed a ‘wip-iconv’ branch, which currently changes ports to > use ‘iconv’ for input. Remaining tasks include doing it for output, and > finding a solution for ‘scm_{to,from}_stringn’ so that it behaves in the > same way wrt. to escapes and error handling. I just merged ‘wip-iconv’ into ‘master’. It uses ‘iconv’ for display/write and peek-char/read-char, but not yet for ‘scm_{to,from}_string’ and ‘read-line’. Caveat: only tested on GNU/Linux. Also, we should take advantage of this to improve error reporting, e.g., to include the location of a conversion failure. Overall, it improves performance, except on Latin-1 ports since I chose not to special-case them (i.e., I/O on Latin-1 ports goes through iconv.) The trick is that iconv conversion descriptors are opened once for all, and no heap allocation happens (‘u32_conv_from_encoding’ and friends typically malloc.) Benchmark results: --8<---------------cut here---------------start------------->8--- ;; with iconv: ("ports.bm: peek-char: latin-1 port" 700000 total 0.38) ("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 0.38) ("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 0.68) ("ports.bm: read-char: latin-1 port" 10000000 total 3.34) ("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.33) ("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.31) ("ports.bm: char-ready?: latin-1 port" 10000000 total 3.02 user 3.01) ("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.0) ("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.01) ;; with libunistring: ("ports.bm: peek-char: latin-1 port" 700000 total 0.25) ("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 2.65) ("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 7.58) ("ports.bm: read-char: latin-1 port" 10000000 total 3.38) ("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.31) ("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.29) ("ports.bm: char-ready?: latin-1 port" 10000000 total 3.08 user 3.08) ("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.08) ("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.05) --8<---------------cut here---------------end--------------->8--- So ‘peek-char’ is faster, whereas ‘read-char’ gives the same results (to my surprise, I must say.) The ‘peek-char’ improvement is beneficial to SSAX. When loading a 4 MiB XML file in UTF-8, it’s ~4 times faster than the old method: --8<---------------cut here---------------start------------->8--- $ time guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml (open-input-file "chbouib.xml"))' real 0m20.509s user 0m20.437s sys 0m0.064s $ time ./meta/guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml (open-input-file "chbouib.xml"))' real 0m5.676s user 0m5.599s sys 0m0.076s --8<---------------cut here---------------end--------------->8--- For ‘write.bm’: --8<---------------cut here---------------start------------->8--- ;; with iconv: ("write.bm: write: string with escapes" 50 total 0.71) ("write.bm: write: string without escapes" 50 total 0.65) ("write.bm: display: string with escapes" 1000 total 3.39) ("write.bm: display: string without escapes" 1000 total 0.97) ;; with libunistring: ("write.bm: write: string with escapes" 50 total 7.06) ("write.bm: write: string without escapes" 50 total 7.51) ("write.bm: display: string with escapes" 1000 total 1.96) ("write.bm: display: string without escapes" 1000 total 1.46) --8<---------------cut here---------------end--------------->8--- In the nominal case, ‘display’ is ~30% faster here, and ‘sxml->xml’ is 60% faster on this 4 MiB XML file: --8<---------------cut here---------------start------------->8--- $ ./meta/guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time (with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))' clock utime stime cutime cstime gctime 2.48 2.44 0.02 0.00 0.00 0.00 $ guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time (with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))' clock utime stime cutime cstime gctime 6.43 6.39 0.04 0.00 0.00 0.00 --8<---------------cut here---------------end--------------->8--- Thanks, Ludo’.