Re: Exercise: words frequency ranking
Venlig hilsen and Timothy Prately Thanks so much. Emeka --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
Hehe, venlig hilsen is danish for kind regards :) On Sat, Jan 3, 2009 at 3:23 PM, Emeka emekami...@gmail.com wrote: Venlig hilsen and Timothy Prately Thanks so much. Emeka -- Venlig hilsen / Kind regards, Christian Vest Hansen. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
Thanks. I have learnt some new. Emeka --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
You could consider using a StreamTokenizer: (import '(java.io StreamTokenizer BufferedReader FileReader)) (defn wordfreq [filename] (with-local-vars [words {}] (let [st (StreamTokenizer. (BufferedReader. (FileReader. filename)))] (loop [tt (.nextToken st)] (when (not= tt StreamTokenizer/TT_EOF) (if (= tt StreamTokenizer/TT_WORD) (let [w (.toLowerCase (.sval st))] (var-set words (assoc @words w (inc (@words w 0)) (recur (.nextToken st) (println (reverse (sort (map (fn [[k v]] [v k]) @words)) For me it was faster (even ignoring output): user= (time (wordfreq wordfreq.txt)) Elapsed time: 444.171796 msecs user= (time (top-words wordfreq.txt out.txt)) Elapsed time: 618.196978 msecs Obviously if you wanted to take this approach you could rework to apply your existing printer for a better comparison. Interestingly when I compared 3 implementations: 1) by Chouser here: http://groups.google.com/group/clojure/browse_thread/thread/d03e75812de6c6e2/5c47c243474c999d?lnk=gstq=sort+by+value#5c47c243474c999d 2) top-words as described 3) Using a StreamTokenizer I get 3 different histograms using a test file! All very similar but slightly different. It is probably largely related to my test file having opposite architecture newlines... shows that word counting is not necessarily a cut and dried thing! Hahahaha, so how just how many words are in this file ??? :) Regards, Tim. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
Hello sir, I would have asked this question in the thread but , I don't want to create noise over this issue. I have not been able to get my head around your code or Clojure. I need some support. (defn top-words-core [s] (reduce #(assoc %1 %2 (inc (%1 %2 0))) {} (re-seq #\w+ (.toLowerCase s My little understanding of Inc is that it returns a number greater than the arg. That's clear to me, however, which argument is passed here (%1 %2 0) Please try as much as you can to simplify your explanation because this may assist me in making a great leap in the learning of of Clojure. Again, #(%)[3 4] works because of closure, so when you apply reduce function of #(.) does it suspend #() closure capability. Emeka --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
(defn top-words-core [s] (reduce #(assoc %1 %2 (inc (%1 %2 0))) {} (re-seq #\w+ (.toLowerCase s maps are functions of their keys means: user= ({:a 1, :b 2, :c 3} :a) 1 Here we created a map {:a 1, :b 2, :c 3}, can then called it like a function with the argument :a and it finds the value associated with that key, which is 1. Maps and keys do this trick by delegating to get, which is a function that looks up stuff: user= (get {:a 1, :b 2} :a) 1 get also accepts an optional third argument, which is returned if key is not found in map: user= (get {:a 1, :b 2} :e 0) 0 But as we saw you don't need to call get, you can just call the map: user= ({:a 1, :b 2, :c 3} :e 0) 0 Keys can be called in the same way (reverse of what we did above): user= (:e {:a 1, :b 2} 99) 99 assoc stores a value in a map associated with a key: user= (assoc {:a 1, :b 2} :b 3) {:a 1, :b 3} So this: #(assoc %1 %2 (inc (%1 %2 0))) Could be written as: (defn map-count [map key] (assoc map key (inc (get map key 0 ie: Given a map and a key, find the value associated with key in map, or 0 if not in the map, and increment it. Return a map that has the key associated with this value. (re-seq #\w+ (.toLowerCase s)) turns the input string to lowercase then applys a regular expression to split it into a sequence of words. Which leaves us with: (reduce map-count {} sequence-of-words) {} empty map is supplied as the initial map and a whole sequence of keys (words) are fed in. reduce essentially does this: (map-count (map-count {} first-word) second-word) etc etc for every word ie: the result is used as input. Regards, Tim. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
Hello. I can't seem to find 'spit'. java exception: unable to resolve symbol spit. I'm using Clojure Box rev1142. Tried using the clojure.jar from the 20081217 release of Clojure but to no avail. spit is not documented on the clojure site API page like slurp is. I can't find it in clojure core where slurp is. Why would there be a slurp (read) but not a spit(write)? Obviously, I am doing something wrong. Can anyone help me? Thanks. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
On Sun, Dec 28, 2008 at 9:22 AM, Boyd Brown boy...@gmail.com wrote: Hello. I can't seem to find 'spit'. 'spit' is in clojure-contrib: http://code.google.com/p/clojure-contrib/source/browse/trunk/src/clojure/contrib/duck_streams.clj?r=325#177 It's inclusion in clojure.core is planned (search for spit): http://richhickey.backpackit.com/pub/1597914 This is an excellent demonstration of why I think it's best to use 'require' over 'use' -- it encourages clearly indicating where non-built-in functions are coming from. The code could have looked like: (ns word-count (:require [clojure.contrib.duck-streams :as ds])) ...and later... (ds/spit ...) --Chouser --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
Thank you for all improvements and suggestions. Based on your feedback, here is my final version: (defn read-words Given a file, return a seq of every word in the file, normalizing words by coverting them to lower case and splitting on whitespace [in-filepath] (re-seq #\w+ (.toLowerCase (slurp in-filepath (defn count-words Given a collection, return a mapping of unique elements in the collection to the number of times that the element appears [coll] (reduce #(merge-with + %1 {%2 1}) {} coll)) (defn format-words Given a map from words to their frequencies, return a pretty string, sorted in descending order by number of appearances [words] (apply str (map #(format %20s : %5d \r\n (key %) (val %)) (sort-by #(- (val %)) words (defn top-words Compute the frequencies of each word in in-filepath. Output the results to out-filepath [in-filepath out-filepath] (spit out-filepath (format-words (count-words (read-words in-filepath) Some robustness notes: On 5.2MB file, it takes 9s compared to 7s of improved Mibu version, or 7s of mine initial one. On 38MB file, it takes 53s and about 270MB of memory. Similarly, the initial one and the mibu versions take 39s and also about 270MB of memory. I also like Ipetit code, except it needs 60s and 530MB RAM. regards, Piotrek --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
And the nice pastie version: http://pastie.org/347369 regards, Piotrek --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
Instead of #(- (val %)), one could also use the compose function : (comp - val) My 0,02 EURO, -- Laurent On Dec 25, 4:58 pm, Mibu mibu.cloj...@gmail.com wrote: My version: (defn top-words [input-filename result-filename] (spit result-filename (apply str (map #(format %s : %d\n (first %) (second %)) (sort-by #(-(val %)) (reduce #(conj %1 { %2 (inc (%1 %2 0)) }) {} (map #(.toLowerCase %) (re-seq #\w+ (slurp input-filename) Mibu On Dec 25, 2:16 pm, Piotr 'Qertoip' Włodarek qert...@gmail.com wrote: Given the input text file, the program should write to disk a ranking of words sorted by frequency, like: the : 52483 and : 32558 of : 23477 a : 22486 to : 21993 My first implementation: (defn topwords [in-filepath, out-filepath] (def words (.split (.toLowerCase (slurp in-filepath)) \\s+)) (spit out-filepath (apply str (concat (map (fn [pair] (format %20s : %5d \r\n (key pair) (val pair))) (sort-by #( -(val %) ) (reduce (fn [counted-words word] ( assoc counted-words word (inc (get counted-words word 0)) )) {} words))) [\r\n] Somehow I feel it's far from optimal. Could you please advise and improve? What is the best, idiomatic implementation of this simple problem? regards, Piotrek --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
What would you think of this form of coding ? - The rationale is to separate functions that deal with system boundaries from core algorithmic functions. So you should at least have two functions : one that does not deal with input/output formats : will only deal with clojure/java constructs. - Don't expose too early functions that are just here to simplify the algorithm : there's already the possibility to use defn- , but there's also the possibility to embed functions in the principal function by using let and inner functions - And I also tried to write the core algorithmic function as functional as I can. Do you think the functional version is more ore less obfuscated ? Here would be the core function (taking a string as an input, and outputting the sorted sequence of [word 2] vectors) : (defn topwords [str] Takes a string as an input, and returns a sequence of vectors of pairs [word nb-of-word-occurences] (let [words (let [ls (System/getProperty line.separator)] #(.split % ls)) freqs (partial reduce #(merge-with + %1 {%2 1}) {}) sort (partial sort-by (comp - val))] (- str words freqs sort))) HTH, -- Laurent On Dec 26, 4:37 pm, lpetit laurent.pe...@gmail.com wrote: Instead of #(- (val %)), one could also use the compose function : (comp - val) My 0,02 EURO, -- Laurent On Dec 25, 4:58 pm, Mibu mibu.cloj...@gmail.com wrote: My version: (defn top-words [input-filename result-filename] (spit result-filename (apply str (map #(format %s : %d\n (first %) (second %)) (sort-by #(-(val %)) (reduce #(conj %1 { %2 (inc (%1 %2 0)) }) {} (map #(.toLowerCase %) (re-seq #\w+ (slurp input-filename) Mibu On Dec 25, 2:16 pm, Piotr 'Qertoip' Włodarek qert...@gmail.com wrote: Given the input text file, the program should write to disk a ranking of words sorted by frequency, like: the : 52483 and : 32558 of : 23477 a : 22486 to : 21993 My first implementation: (defn topwords [in-filepath, out-filepath] (def words (.split (.toLowerCase (slurp in-filepath)) \\s+)) (spit out-filepath (apply str (concat (map (fn [pair] (format %20s : %5d \r\n (key pair) (val pair))) (sort-by #( -(val %) ) (reduce (fn [counted-words word] ( assoc counted-words word (inc (get counted-words word 0)) )) {} words))) [\r\n] Somehow I feel it's far from optimal. Could you please advise and improve? What is the best, idiomatic implementation of this simple problem? regards, Piotrek --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
On Dec 25, 4:58 pm, Mibu mibu.cloj...@gmail.com wrote: My version: (defn top-words [input-filename result-filename] (spit result-filename (apply str (map #(format %s : %d\n (first %) (second %)) (sort-by #(-(val %)) (reduce #(conj %1 { %2 (inc (%1 %2 0)) }) {} (map #(.toLowerCase %) (re-seq #\w+ (slurp input-filename) Mibu Once you move .toLowerCase right after slurp, it gets 3 times faster. regards, Piotrek --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
My version: (defn top-words [input-filename result-filename] (spit result-filename (apply str (map #(format %s : %d\n (first %) (second %)) (sort-by #(-(val %)) (reduce #(conj %1 { %2 (inc (%1 %2 0)) }) {} (map #(.toLowerCase %) (re-seq #\w+ (slurp input-filename) Mibu On Dec 25, 2:16 pm, Piotr 'Qertoip' Włodarek qert...@gmail.com wrote: Given the input text file, the program should write to disk a ranking of words sorted by frequency, like: the : 52483 and : 32558 of : 23477 a : 22486 to : 21993 My first implementation: (defn topwords [in-filepath, out-filepath] (def words (.split (.toLowerCase (slurp in-filepath)) \\s+)) (spit out-filepath (apply str (concat (map (fn [pair] (format %20s : %5d \r\n (key pair) (val pair))) (sort-by #( -(val %) ) (reduce (fn [counted-words word] ( assoc counted-words word (inc (get counted-words word 0)) )) {} words))) [\r\n] Somehow I feel it's far from optimal. Could you please advise and improve? What is the best, idiomatic implementation of this simple problem? regards, Piotrek --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~--~~~~--~~--~--~---
Re: Exercise: words frequency ranking
Hi, Am 25.12.2008 um 17:24 schrieb wwmorgan: A better implementation would split the different steps of the program into separate functions. This increases readability and testability of the source code, and encourages the reuse of code in new programs. Yes. One can think of the data flowing through the different functions being modified as it goes. I also made the experience, that normalising the input for one function in another one can drastically reduce complexity in the final worker. (defn topwords Compute the frequencies of each word in in-filepath. Output the results to out-filepath [in-filepath out-filepath] (spit out-filepath (str (freq-format (freqs (word-seq in-filepath))) \r\n))) Here you see this data flow: the output of one function is the input of another. Clojure has a very nice macro showing this also visually: -. (And it saves a lot of parens...) :) (defn topwords Compute ... [in-filepath out-filepath] (let [nl (System/getProperty lineSeparator) result (- in-filepath word-seq freqs freq-format (str nl))] (spit out-filepath result))) Of course this has no functional effect, just cosmetics. (Although I replaced the ugly \r\n...) Similar for word-seq: (defn word-seq Given ... [file] (- file (slurp \\s+) .toLowerCase .split seq)) Sincerely Meikel smime.p7s Description: S/MIME cryptographic signature