Thank for all of your advices.

1. I used iota library for convenience.
2. Even though I updated Clojure hash-map by using the transient function, 
it was very slow. Because the final HashMap size is about 300,000,000.  So 
I had no choice but to using mutable Java data structure, HashMap.
3. I could calculate occurrence counts in parallel, and merge the pairs of 
HashMaps in parallel by using core.async.

The results are as follows :
Java version: 9 minutes 30 seconds (Single core)
Clojure version: 3 minutes 45 seconds (20 cores)

We saw a good chance. 
So our team will continue to try prototyping with Clojure.

This was sample code we were testing. 

(ns parallel-test.core
  (:require [clojure.core.async :as async]
            [iota :as iota])
  (:import (java.util HashMap Map$Entry))
  (:gen-class))

(def corpus-file-url "resources/korean.txt")
(def OC (atom nil))
(def MPL 12)

(def cpu-core-num 16)
(def corpus-file-vec (iota/vec corpus-file-url))
(def corpus-lines-num (count corpus-file-vec))
(def each-size (-> (/ corpus-lines-num cpu-core-num)
                   Math/ceil
                   int))

(defn add-pattern-to-hashmap
  [^HashMap h-map ^String ptn ^Integer ptn-oc]
  (let [h-ptn-oc (.get h-map ptn)
        n-ptn-oc (if (nil? h-ptn-oc)
                   ptn-oc
                   (+ h-ptn-oc ptn-oc))]
    (.put h-map ptn n-ptn-oc)))

(defn cal-lines-oc
  [lines]
  (let [r-map (HashMap.)]
    (doseq [line lines]
      (let [line-length (count line)]
        (doseq [i (range line-length)]
          (doseq [j (range 1 (inc MPL))
                  :let [end-index (+ i j)]
                  :while (<= end-index line-length)]
            (let [pattern (subs line i end-index)]
              (add-pattern-to-hashmap r-map pattern 1))))))
    r-map))

(defn merge-hashmap
  [^HashMap l-map ^HashMap r-map]
  (println "Merged map size: " [(count l-map) (count r-map)])
  (doseq [^Map$Entry entry (.entrySet r-map)]
    (add-pattern-to-hashmap l-map (.getKey entry) (.getValue entry)))
  l-map)

(defn parallel-cal-oc
  [pipeline process-fn input-vec]
  (doall
    (->> (map list (range) (partition-all each-size input-vec))
         (map (fn [[index lines]]
                (println (* (inc index) each-size) " lines processing!")
                (future (async/>!! pipeline (process-fn lines))))))))

(defn parallel-merge-hashmap
  [pipeline batch-num out merge-hashmap]
  (async/go-loop [m-count 1]
    (if (>= m-count batch-num)
      (do
        (async/>! out (async/<! pipeline))
        (async/close! pipeline))
      (let [l-map (async/<! pipeline)
            r-map (async/<! pipeline)]
        (println "current m-count: " m-count)
        (future (async/>!! pipeline (merge-hashmap l-map r-map)))
        (recur (inc m-count))))))

(defn -main [& args]
  (let [start-time (System/currentTimeMillis)
        pipeline (async/chan cpu-core-num)
        out (async/chan 1)
        batch-num (-> (/ corpus-lines-num each-size)
                      Math/ceil
                      int)]

    (println "start time: " start-time)
    (parallel-cal-oc pipeline cal-lines-oc corpus-file-vec)
    (parallel-merge-hashmap pipeline batch-num out merge-hashmap)
    (reset! OC (async/<!! out))
    (async/close! out)
    (let [end-time (System/currentTimeMillis)
          elapsed-time (double (/ (- end-time start-time) 60000))
          minute (int elapsed-time)
          second (* (rem elapsed-time 1) 60)
          elapsed-time-str (str "Elapsed time " minute ":" second)]
      (println "OC hashmap size: " (count @OC))
      (println "end time: " end-time)
      (println elapsed-time-str))))


2017년 9월 13일 수요일 오전 1시 43분 58초 UTC+9, darren...@gmail.com 님의 말:
>
> Hi, 
>
> I am a researcher of Natural Language Processing.
> My team want to know how well does Clojure parallelize and how much time 
> is reduced compared by Java single thread version.
>
> The problem we want to solve is, 
> there is a big corpus file (just now 500MB).
> Reading sentences line by line, find all patterns and their occurrence 
> count on length 1 through 12.
>
> It is a very simple problem and It doesn’t care of order of processing.
> We want to make just a big hash-map. (Key is a pattern string, Value is a 
> occurrence count.)
> Ex) { “father” 10000000 “mother” 10000000 … }
>
> Comparing performance between Java and Clojure, if Clojure version is 
> better than Java, 
> then we’ll change our code base to Clojure, if not, we cannot help staying 
> Java. 
>
> Anyway my first prototype is very very slow. I’m a novice.  :( 
>
> Please give me some advices.
> Thanks.  
>
> (ns parallel-test.core
>  (:require [clojure.java.io :as jio]
>            [clojure.core.reducers :refer [fold]])
>  (:gen-class))
>
> (def corpus-file-url "resources/korean.txt")
> (def OC (atom nil))
> (def MPL 12)
> (def each-size 10000)
>
> (defn add-pattern-to-hashmap
>  [h-map ^String ptn ^Integer ptn-oc]
>  (let [h-ptn-oc (get h-map ptn)
>        n-ptn-oc (if (nil? h-ptn-oc)
>                  ptn-oc
>                  (+ h-ptn-oc ptn-oc))]
>   (assoc h-map ptn n-ptn-oc)))
>
> (defn merge-hash-map
>  ([] (hash-map))
>  ([& hs]
>   (reduce (fn [l-map r-map]
>            (reduce (fn [[ptn ptn-oc]]
>                     (add-pattern-to-hashmap l-map ptn ptn-oc))
>                    r-map))
>           hs)))
>
> (defn cal-line-oc
>  ([] (hash-map))
>  ([h-map ^String line]
>   (let [line-length (count line)]
>    (loop [i 0
>           i-map h-map]
>     (if (>= i line-length)
>      i-map
>      (recur (inc i)
>             (loop [j 1
>                    j-map i-map]
>              (let [end-index (+ i j)]
>               (if (or (> j MPL) (> end-index line-length))
>                j-map
>                (recur (inc j)
>                       (add-pattern-to-hashmap j-map (subs line i end-index) 
> 1)))))))))))
>
> (defn parallel-process
>  [combine-fn reduce-fn input-file]
>  (with-open [rdr (jio/reader input-file)]
>   (fold each-size
>         combine-fn
>         reduce-fn
>         (line-seq rdr))))
>
> (defn -main [& args]
>  (println "start")
>  (reset! OC (parallel-process merge-hash-map cal-line-oc corpus-file-url))
>  (println "end"))
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to