On Jan 21, 2014, at 07:11 , Chris Perkins <chrisperkin...@gmail.com> wrote:

> This part: (some #{hashed} already-seen) is doing a linear lookup in 
> `already-seen`. Try (contains? already-seen hashed) instead.

Or just (already-seen hashed), given that OP's not trying to store nil hashes.

To OP: note that if you’re storing the hashes as strings (as it appears), 
you’re using 16 more bytes per hash than necessary. If you’re really going to 
be dealing with so many URLs that you’d use too much memory by storing the 
unique URLs directly, then you should probably be storing the hashes as byte 
arrays.

Alternatively, if you’re going to be dealing with REALLY large files and are 
running on Linux/BSD, consider dumping just the URLs to a file and using “sort 
-u” on it. UNIX Sort can efficiently handle files that are too large to fit in 
memory, via external merge sort.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to