Lazy sequence for reading binary blocks from an input stream

Michael Ashton Tue, 17 Aug 2010 06:34:26 -0700

I'm a few months into learning Clojure, and thought I'd put this
function out for comment.


I need to take a message digest of files on disk. I'm using a class in
java.security to do this. The class uses an update method which
accepts an array of bytes, and updates the hash. This calls for the
common read-update pattern, but in Clojure. So I decided to try my
hand at a lazy sequence of byte arrays:

(defn stream-block-seq
  "A lazy sequence of blocks read from the given input-stream.  Each
block is returned as a separately allocated Java byte array.  The
maximum block size is given as the optional second argument; the
default is 1024.  A returned block may be shorter than the blocksize.
Usually, the last block will be short.  If the stream is exhausted,
the result is nil."
  ([s blocksize]
     (let [buf (byte-array blocksize)
           readlen (.read s buf)]
       (if (>= readlen 0)
         (lazy-seq
          (let [newbuf (if (< readlen blocksize)
                         (copy-array buf (byte-array readlen) readlen)
                         buf)]
            (cons newbuf (stream-block-seq s blocksize)))))))
  ([s] (stream-block-seq s 1024)))

Here's copy-array:

(defn copy-array
  ([src srcpos dest destpos len]
     (do
       (System/arraycopy src srcpos dest destpos len)
       dest))
  ([src dest len]
     (copy-array src 0 dest 0 len)))

And here's the message-digest function that uses it:

(defn message-digest
  "Generates a digest of the given input plaintext.  Input must be a
Java byte array, a Java ByteBuffer.  hashname is optional and defaults
to \"SHA-256\".  The result is a vector of bytes.

See 
http://download.oracle.com/javase/1.5.0/docs/guide/security/CryptoSpec.html#AppA
for more information on the available hashes."
  ([input & opts]
     (let [opts (merge { :hash "SHA-256" :blocksize 32768 } (apply
hash-map opts))
           hashname (opts :hash)
           blocksize (opts :blocksize)
           md (MessageDigest/getInstance hashname)]
       (doseq [buf (stream-block-seq (input-stream input) blocksize)]
         (.update md buf))
       (vec (.digest md)))))

This all seems to work, and the performance seems acceptable: with a
32k buffer size, on my Core 2 Duo Macbook it takes about 50ms to hash
a 1MiB file from disk, and 20ms from filesystem cache. However, I'm
sure there's plenty of room for improvement. Is there a cleaner or
more efficient way to do this?

I found two previous threads which deal with similar puzzles --

* Resource cleanup when lazy sequences are finalized:
http://groups.google.com/group/clojure/browse_thread/thread/caece062119de072/13c15c62c3397597?lnk=gst&q=lazy+buffered#13c15c62c3397597

* contrib mmap/duck_streams for binary data:
http://groups.google.com/group/clojure/browse_thread/thread/f5239c7e66e7fb54/813b70b68081456d?lnk=gst&q=lazy+binary+stream#813b70b68081456d

I must say that I find the lazy-sequence approach conceptually quite
attractive here, but my taste may not yet be properly formed :)

Comments welcome!

thanks
Michael Ashton.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Lazy sequence for reading binary blocks from an input stream

Reply via email to