Re: Exercise: words frequency ranking

2009-01-03 Thread Emeka
Venlig hilsen and Timothy Prately

Thanks so much.

Emeka

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2009-01-03 Thread Christian Vest Hansen

Hehe, venlig hilsen is danish for kind regards :)

On Sat, Jan 3, 2009 at 3:23 PM, Emeka emekami...@gmail.com wrote:
 Venlig hilsen and Timothy Prately

 Thanks so much.

 Emeka


 




-- 
Venlig hilsen / Kind regards,
Christian Vest Hansen.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2009-01-03 Thread Emeka
Thanks. I have learnt some new.

Emeka

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-29 Thread Timothy Pratley

You could consider using a StreamTokenizer:

(import '(java.io StreamTokenizer BufferedReader FileReader))
(defn wordfreq [filename]
  (with-local-vars [words {}]
(let [st (StreamTokenizer. (BufferedReader. (FileReader.
filename)))]
  (loop [tt (.nextToken st)]
(when (not= tt StreamTokenizer/TT_EOF)
  (if (= tt StreamTokenizer/TT_WORD)
(let [w (.toLowerCase (.sval st))]
(var-set words (assoc @words w (inc (@words w 0))
  (recur (.nextToken st)
(println (reverse (sort (map (fn [[k v]] [v k]) @words))


For me it was faster (even ignoring output):
user= (time (wordfreq wordfreq.txt))
Elapsed time: 444.171796 msecs
user= (time (top-words wordfreq.txt out.txt))
Elapsed time: 618.196978 msecs

Obviously if you wanted to take this approach you could rework to
apply your existing printer for a better comparison.

Interestingly when I compared 3 implementations:

1) by Chouser here:
http://groups.google.com/group/clojure/browse_thread/thread/d03e75812de6c6e2/5c47c243474c999d?lnk=gstq=sort+by+value#5c47c243474c999d
2) top-words as described
3) Using a StreamTokenizer

I get 3 different histograms using a test file! All very similar but
slightly different. It is probably largely related  to my test file
having opposite architecture newlines... shows that word counting is
not necessarily a cut and dried thing! Hahahaha, so how just how many
words are in this file ??? :)

Regards,
Tim.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-29 Thread Emeka
Hello sir,

I would have asked this question in the thread but , I don't want to create
noise over this issue.
I have not been able to get my head around your code or Clojure. I need some
support.


(defn top-words-core [s]
 (reduce #(assoc %1 %2 (inc (%1 %2 0))) {}
 (re-seq #\w+
 (.toLowerCase s

My little understanding of Inc is that it returns a number greater  than the
arg. That's clear to me, however, which argument is passed here (%1 %2 0)
Please try as much as you can to simplify your explanation because this may
assist me in making a great leap in the learning of of Clojure. Again,
#(%)[3 4] works because of closure, so when you apply reduce function of
#(.) does it suspend #() closure capability.

Emeka

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-29 Thread Timothy Pratley

 (defn top-words-core [s]
      (reduce #(assoc %1 %2 (inc (%1 %2 0))) {}
              (re-seq #\w+
                      (.toLowerCase s

maps are functions of their keys means:
user= ({:a 1, :b 2, :c 3} :a)
1
Here we created a map {:a 1, :b 2, :c 3}, can then called it like a
function with the argument :a and it finds the value associated with
that key, which is 1.
Maps and keys do this trick by delegating to get, which is a function
that looks up stuff:
user= (get {:a 1, :b 2} :a)
1

get also accepts an optional third argument, which is returned if key
is not found in map:
user= (get {:a 1, :b 2} :e 0)
0

But as we saw you don't need to call get, you can just call the map:
user= ({:a 1, :b 2, :c 3} :e 0)
0

Keys can be called in the same way (reverse of what we did above):
user= (:e {:a 1, :b 2} 99)
99

assoc stores a value in a map associated with a key:
user= (assoc {:a 1, :b 2} :b 3)
{:a 1, :b 3}

So this:
#(assoc %1 %2 (inc (%1 %2 0)))

Could be written as:
(defn map-count [map key]
  (assoc map key (inc (get map key 0

ie: Given a map and a key, find the value associated with key  in map,
or 0 if not in the map, and increment it. Return a map that has the
key associated with this value.


(re-seq #\w+ (.toLowerCase s)) turns the input string to lowercase
then applys a regular expression to split it into a sequence of words.
Which leaves us with:

(reduce map-count {} sequence-of-words)

{} empty map is supplied as the initial map and a whole sequence of
keys (words) are fed in.
reduce essentially does this: (map-count (map-count {} first-word)
second-word) etc etc for every word
ie: the result is used as input.


Regards,
Tim.
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-28 Thread Boyd Brown

Hello.  I can't seem to find 'spit'.

java exception: unable to resolve symbol spit.

I'm using Clojure Box rev1142.  Tried using the clojure.jar from the
20081217 release
of Clojure but to no avail.

spit is not documented on the clojure site API page like slurp is.  I
can't
find it in clojure core where slurp is.  Why would there be a slurp
(read)
but not a spit(write)?

Obviously, I am doing something wrong.  Can anyone help me?

Thanks.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-28 Thread Chouser

On Sun, Dec 28, 2008 at 9:22 AM, Boyd Brown boy...@gmail.com wrote:

 Hello.  I can't seem to find 'spit'.

'spit' is in clojure-contrib:
http://code.google.com/p/clojure-contrib/source/browse/trunk/src/clojure/contrib/duck_streams.clj?r=325#177

It's inclusion in clojure.core is planned (search for spit):
http://richhickey.backpackit.com/pub/1597914

This is an excellent demonstration of why I think it's best to use
'require' over 'use' -- it encourages clearly indicating where
non-built-in functions are coming from.  The code could have looked
like:

(ns word-count
  (:require [clojure.contrib.duck-streams :as ds]))

...and later...

(ds/spit ...)

--Chouser

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-27 Thread Piotr 'Qertoip' Włodarek

Thank you for all improvements and suggestions. Based on your
feedback, here is my final version:


(defn read-words
  Given a file, return a seq of every word in the file, normalizing
words by
  coverting them to lower case and splitting on whitespace
  [in-filepath]
  (re-seq #\w+
  (.toLowerCase (slurp in-filepath

(defn count-words
  Given a collection, return a mapping of unique elements in the
collection
  to the number of times that the element appears
  [coll]
  (reduce #(merge-with + %1 {%2 1}) {} coll))

(defn format-words
  Given a map from words to their frequencies, return a pretty
string,
  sorted in descending order by number of appearances
  [words]
  (apply str
 (map #(format %20s : %5d \r\n (key %) (val %))
  (sort-by #(- (val %))
   words

(defn top-words
  Compute the frequencies of each word in in-filepath. Output the
results to
  out-filepath
  [in-filepath out-filepath]
  (spit out-filepath
(format-words (count-words (read-words in-filepath)


Some robustness notes:

On 5.2MB file, it takes 9s compared to 7s of improved Mibu version, or
7s of mine initial one.

On 38MB file, it takes 53s and about 270MB of memory. Similarly, the
initial one and the mibu versions take 39s and also about 270MB of
memory. I also like Ipetit code, except it needs 60s and 530MB RAM.


regards,
Piotrek
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-27 Thread Piotr 'Qertoip' Włodarek

And the nice pastie version: http://pastie.org/347369


regards,
Piotrek
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-26 Thread lpetit

Instead of #(- (val %)), one could also use the compose function :
(comp - val)

My 0,02 EURO,

--
Laurent

On Dec 25, 4:58 pm, Mibu mibu.cloj...@gmail.com wrote:
 My version:

 (defn top-words [input-filename result-filename]
   (spit result-filename
 (apply str
(map #(format %s : %d\n (first %) (second %))
 (sort-by #(-(val %))
  (reduce #(conj %1 { %2 (inc (%1 %2 0)) }) {}
  (map #(.toLowerCase %)
   (re-seq #\w+
   (slurp 
 input-filename)

 Mibu

 On Dec 25, 2:16 pm, Piotr 'Qertoip' Włodarek qert...@gmail.com
 wrote:

  Given the input text file, the program should write to disk a ranking
  of words sorted by frequency, like:

   the : 52483
   and : 32558
of : 23477
 a : 22486
to : 21993

  My first implementation:

  (defn topwords [in-filepath, out-filepath]
(def words (.split (.toLowerCase (slurp in-filepath)) \\s+))

(spit out-filepath
  (apply  str
  (concat
(map (fn [pair] (format %20s : %5d \r\n (key pair)
  (val pair)))
 (sort-by #( -(val %) )
  (reduce
(fn [counted-words word]
( assoc counted-words
word
(inc (get counted-words
  word 0)) ))
{}
words)))
[\r\n]

  Somehow I feel it's far from optimal. Could you please advise and
  improve? What is the best, idiomatic implementation of this simple
  problem?

  regards,
  Piotrek
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-26 Thread lpetit

What would you think of this form of coding ?
- The rationale is to separate functions that deal with system
boundaries from core algorithmic functions.
So you should at least have two functions : one that does not deal
with input/output formats : will only deal with clojure/java
constructs.
- Don't expose too early functions that are just here to simplify
the algorithm : there's already the possibility to use defn- , but
there's also the possibility to embed functions in the principal
function by using let and inner functions
- And I also tried to write the core algorithmic function as
functional as I can.
Do you think the functional version is more ore less obfuscated ?

Here would be the core function (taking a string as an input, and
outputting the sorted sequence of [word 2] vectors) :

(defn topwords [str]
  Takes a string as an input, and returns a sequence of vectors of
pairs [word nb-of-word-occurences]
  (let [words (let [ls (System/getProperty line.separator)]
#(.split % ls))
freqs (partial reduce #(merge-with + %1 {%2 1}) {})
sort (partial sort-by (comp - val))]
(- str words freqs sort)))


HTH,
--
Laurent

On Dec 26, 4:37 pm, lpetit laurent.pe...@gmail.com wrote:
 Instead of #(- (val %)), one could also use the compose function :
 (comp - val)

 My 0,02 EURO,

 --
 Laurent

 On Dec 25, 4:58 pm, Mibu mibu.cloj...@gmail.com wrote:

  My version:

  (defn top-words [input-filename result-filename]
    (spit result-filename
          (apply str
                 (map #(format %s : %d\n (first %) (second %))
                      (sort-by #(-(val %))
                               (reduce #(conj %1 { %2 (inc (%1 %2 0)) }) {}
                                       (map #(.toLowerCase %)
                                            (re-seq #\w+
                                                    (slurp 
  input-filename)

  Mibu

  On Dec 25, 2:16 pm, Piotr 'Qertoip' Włodarek qert...@gmail.com
  wrote:

   Given the input text file, the program should write to disk a ranking
   of words sorted by frequency, like:

                    the : 52483
                    and : 32558
                     of : 23477
                      a : 22486
                     to : 21993

   My first implementation:

   (defn topwords [in-filepath, out-filepath]
     (def words (.split (.toLowerCase (slurp in-filepath)) \\s+))

     (spit out-filepath
           (apply  str
                   (concat
                     (map (fn [pair] (format %20s : %5d \r\n (key pair)
   (val pair)))
                          (sort-by #( -(val %) )
                                   (reduce
                                     (fn [counted-words word]
                                         ( assoc counted-words
                                                 word
                                                 (inc (get counted-words
   word 0)) ))
                                     {}
                                     words)))
                     [\r\n]

   Somehow I feel it's far from optimal. Could you please advise and
   improve? What is the best, idiomatic implementation of this simple
   problem?

   regards,
   Piotrek
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-26 Thread Piotr 'Qertoip' Włodarek

On Dec 25, 4:58 pm, Mibu mibu.cloj...@gmail.com wrote:
 My version:

 (defn top-words [input-filename result-filename]
   (spit result-filename
         (apply str
                (map #(format %s : %d\n (first %) (second %))
                     (sort-by #(-(val %))
                              (reduce #(conj %1 { %2 (inc (%1 %2 0)) }) {}
                                      (map #(.toLowerCase %)
                                           (re-seq #\w+
                                                   (slurp 
 input-filename)

 Mibu

Once you move .toLowerCase right after slurp, it gets 3 times faster.


regards,
Piotrek
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-25 Thread Mibu

My version:

(defn top-words [input-filename result-filename]
  (spit result-filename
(apply str
   (map #(format %s : %d\n (first %) (second %))
(sort-by #(-(val %))
 (reduce #(conj %1 { %2 (inc (%1 %2 0)) }) {}
 (map #(.toLowerCase %)
  (re-seq #\w+
  (slurp input-filename)

Mibu

On Dec 25, 2:16 pm, Piotr 'Qertoip' Włodarek qert...@gmail.com
wrote:
 Given the input text file, the program should write to disk a ranking
 of words sorted by frequency, like:

                  the : 52483
                  and : 32558
                   of : 23477
                    a : 22486
                   to : 21993

 My first implementation:

 (defn topwords [in-filepath, out-filepath]
   (def words (.split (.toLowerCase (slurp in-filepath)) \\s+))

   (spit out-filepath
         (apply  str
                 (concat
                   (map (fn [pair] (format %20s : %5d \r\n (key pair)
 (val pair)))
                        (sort-by #( -(val %) )
                                 (reduce
                                   (fn [counted-words word]
                                       ( assoc counted-words
                                               word
                                               (inc (get counted-words
 word 0)) ))
                                   {}
                                   words)))
                   [\r\n]

 Somehow I feel it's far from optimal. Could you please advise and
 improve? What is the best, idiomatic implementation of this simple
 problem?

 regards,
 Piotrek
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Clojure group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---



Re: Exercise: words frequency ranking

2008-12-25 Thread Meikel Brandmeyer

Hi,

Am 25.12.2008 um 17:24 schrieb wwmorgan:


A better implementation would split the different steps of the program
into separate functions. This increases readability and testability of
the source code, and encourages the reuse of code in new programs.


Yes. One can think of the data flowing through the different
functions being modified as it goes. I also made the experience,
that normalising the input for one function in another one can
drastically reduce complexity in the final worker.


(defn topwords
 Compute the frequencies of each word in in-filepath. Output the
results to
  out-filepath
 [in-filepath out-filepath]
   (spit out-filepath
 (str (freq-format (freqs (word-seq in-filepath)))
  \r\n)))


Here you see this data flow: the output of one function is the
input of another. Clojure has a very nice macro showing this
also visually: -. (And it saves a lot of parens...) :)

(defn topwords
  Compute ...
  [in-filepath out-filepath]
  (let [nl (System/getProperty lineSeparator)
result (- in-filepath word-seq freqs freq-format (str nl))]
(spit out-filepath result)))

Of course this has no functional effect, just cosmetics.
(Although I replaced the ugly \r\n...)

Similar for word-seq:

(defn word-seq
  Given ...
  [file]
  (- file (slurp \\s+) .toLowerCase .split seq))

Sincerely
Meikel



smime.p7s
Description: S/MIME cryptographic signature