Thank for all of your advices. 1. I used iota library for convenience. 2. Even though I updated Clojure hash-map by using the transient function, it was very slow. Because the final HashMap size is about 300,000,000. So I had no choice but to using mutable Java data structure, HashMap. 3. I could calculate occurrence counts in parallel, and merge the pairs of HashMaps in parallel by using core.async.
The results are as follows : Java version: 9 minutes 30 seconds (Single core) Clojure version: 3 minutes 45 seconds (20 cores) We saw a good chance. So our team will continue to try prototyping with Clojure. This was sample code we were testing. (ns parallel-test.core (:require [clojure.core.async :as async] [iota :as iota]) (:import (java.util HashMap Map$Entry)) (:gen-class)) (def corpus-file-url "resources/korean.txt") (def OC (atom nil)) (def MPL 12) (def cpu-core-num 16) (def corpus-file-vec (iota/vec corpus-file-url)) (def corpus-lines-num (count corpus-file-vec)) (def each-size (-> (/ corpus-lines-num cpu-core-num) Math/ceil int)) (defn add-pattern-to-hashmap [^HashMap h-map ^String ptn ^Integer ptn-oc] (let [h-ptn-oc (.get h-map ptn) n-ptn-oc (if (nil? h-ptn-oc) ptn-oc (+ h-ptn-oc ptn-oc))] (.put h-map ptn n-ptn-oc))) (defn cal-lines-oc [lines] (let [r-map (HashMap.)] (doseq [line lines] (let [line-length (count line)] (doseq [i (range line-length)] (doseq [j (range 1 (inc MPL)) :let [end-index (+ i j)] :while (<= end-index line-length)] (let [pattern (subs line i end-index)] (add-pattern-to-hashmap r-map pattern 1)))))) r-map)) (defn merge-hashmap [^HashMap l-map ^HashMap r-map] (println "Merged map size: " [(count l-map) (count r-map)]) (doseq [^Map$Entry entry (.entrySet r-map)] (add-pattern-to-hashmap l-map (.getKey entry) (.getValue entry))) l-map) (defn parallel-cal-oc [pipeline process-fn input-vec] (doall (->> (map list (range) (partition-all each-size input-vec)) (map (fn [[index lines]] (println (* (inc index) each-size) " lines processing!") (future (async/>!! pipeline (process-fn lines)))))))) (defn parallel-merge-hashmap [pipeline batch-num out merge-hashmap] (async/go-loop [m-count 1] (if (>= m-count batch-num) (do (async/>! out (async/<! pipeline)) (async/close! pipeline)) (let [l-map (async/<! pipeline) r-map (async/<! pipeline)] (println "current m-count: " m-count) (future (async/>!! pipeline (merge-hashmap l-map r-map))) (recur (inc m-count)))))) (defn -main [& args] (let [start-time (System/currentTimeMillis) pipeline (async/chan cpu-core-num) out (async/chan 1) batch-num (-> (/ corpus-lines-num each-size) Math/ceil int)] (println "start time: " start-time) (parallel-cal-oc pipeline cal-lines-oc corpus-file-vec) (parallel-merge-hashmap pipeline batch-num out merge-hashmap) (reset! OC (async/<!! out)) (async/close! out) (let [end-time (System/currentTimeMillis) elapsed-time (double (/ (- end-time start-time) 60000)) minute (int elapsed-time) second (* (rem elapsed-time 1) 60) elapsed-time-str (str "Elapsed time " minute ":" second)] (println "OC hashmap size: " (count @OC)) (println "end time: " end-time) (println elapsed-time-str)))) 2017년 9월 13일 수요일 오전 1시 43분 58초 UTC+9, darren...@gmail.com 님의 말: > > Hi, > > I am a researcher of Natural Language Processing. > My team want to know how well does Clojure parallelize and how much time > is reduced compared by Java single thread version. > > The problem we want to solve is, > there is a big corpus file (just now 500MB). > Reading sentences line by line, find all patterns and their occurrence > count on length 1 through 12. > > It is a very simple problem and It doesn’t care of order of processing. > We want to make just a big hash-map. (Key is a pattern string, Value is a > occurrence count.) > Ex) { “father” 10000000 “mother” 10000000 … } > > Comparing performance between Java and Clojure, if Clojure version is > better than Java, > then we’ll change our code base to Clojure, if not, we cannot help staying > Java. > > Anyway my first prototype is very very slow. I’m a novice. :( > > Please give me some advices. > Thanks. > > (ns parallel-test.core > (:require [clojure.java.io :as jio] > [clojure.core.reducers :refer [fold]]) > (:gen-class)) > > (def corpus-file-url "resources/korean.txt") > (def OC (atom nil)) > (def MPL 12) > (def each-size 10000) > > (defn add-pattern-to-hashmap > [h-map ^String ptn ^Integer ptn-oc] > (let [h-ptn-oc (get h-map ptn) > n-ptn-oc (if (nil? h-ptn-oc) > ptn-oc > (+ h-ptn-oc ptn-oc))] > (assoc h-map ptn n-ptn-oc))) > > (defn merge-hash-map > ([] (hash-map)) > ([& hs] > (reduce (fn [l-map r-map] > (reduce (fn [[ptn ptn-oc]] > (add-pattern-to-hashmap l-map ptn ptn-oc)) > r-map)) > hs))) > > (defn cal-line-oc > ([] (hash-map)) > ([h-map ^String line] > (let [line-length (count line)] > (loop [i 0 > i-map h-map] > (if (>= i line-length) > i-map > (recur (inc i) > (loop [j 1 > j-map i-map] > (let [end-index (+ i j)] > (if (or (> j MPL) (> end-index line-length)) > j-map > (recur (inc j) > (add-pattern-to-hashmap j-map (subs line i end-index) > 1))))))))))) > > (defn parallel-process > [combine-fn reduce-fn input-file] > (with-open [rdr (jio/reader input-file)] > (fold each-size > combine-fn > reduce-fn > (line-seq rdr)))) > > (defn -main [& args] > (println "start") > (reset! OC (parallel-process merge-hash-map cal-line-oc corpus-file-url)) > (println "end")) > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.