I am processing a very large xml file, 13MB, using clojure.xml.parse
and clojure.contrib.zip-filter.xml with clojure 1.0.0.

The xml file contains information on 13000 japanese characters and I'm
extracting about 200 or so.

At its core it extracts a very small subset of elements using:

(xml-> kdic :character [:literal #(contains? kcset (text %))] node)

Where kcset is a set of desired characters.

My understanding of this is that it returns a lazy-seq which if I
"count"-ed the length of the sequence it would return 200 (not
13000).  But in practice it actually generates a stack overflow.

At the end of this post I have a relatively short version of the
program which throws the stack overflow.  In this case it has a
(count ...) call which causes the stack overflow.  In the full program
I tried a few variations like so:

    (dorun (for [knode knodes] (print-kinfo knode))))

To try to get the information to print, but before it also reaches the
end of list it also throws a stack overflow.

I also have the stack trace at the end as well.

Thanks!


Here's the short version of the program:

(ns kanji.prkanji
    (:use clojure.xml )
    (:use [clojure.zip :only (xml-zip node)])
    (:use clojure.contrib.zip-filter.xml)
    (:import java.lang.Character$UnicodeBlock)
    (:import java.io.File))

(def CJK Character$UnicodeBlock/CJK_UNIFIED_IDEOGRAPHS)

(defn filter-for-kanji
    [chars]
    (filter #(= CJK (Character$UnicodeBlock/of %)) chars))

(defn get-unique-kanji
    [chars]
    (let [kchars (filter-for-kanji chars)]
             (set kchars)))

(defn print-kinfos
    [knodes]
    (count knodes))
;; this is what I would normally do: (dorun (for [knode knodes] (print-
kinfo knode))))

(defn get-kdic-info
    [kdic kchars]
    (let [kcset (set (map str kchars))]
        (xml-> kdic :character [:literal #(contains? kcset (text %))]
node)))

(defn load-kdic
    [fname]
    (xml-zip (parse (File. fname))))

(defn process-file
    [file]
    (let [kchars (get-unique-kanji (slurp file))]
        (print-kinfos
            (get-kdic-info
                (load-kdic "kanji/kdic-data.xml") kchars))))

(process-file (second *command-line-args*))

And here's the top of the stack trace:

Exception in thread "main" java.lang.StackOverflowError (prkanji.clj:
0)
        at clojure.lang.Compiler.eval(Compiler.java:4543)
        at clojure.lang.Compiler.load(Compiler.java:4857)
        at clojure.lang.Compiler.loadFile(Compiler.java:4824)
        at clojure.main$load_script__5833.invoke(main.clj:206)
        at clojure.main$init_opt__5836.invoke(main.clj:211)
        at clojure.main$initialize__5846.invoke(main.clj:239)
        at clojure.main$null_opt__5868.invoke(main.clj:264)
        at clojure.main$legacy_script__5883.invoke(main.clj:295)
        at clojure.lang.Var.invoke(Var.java:346)
        at clojure.main.legacy_script(main.java:34)
        at clojure.lang.Script.main(Script.java:20)
Caused by: java.lang.StackOverflowError
        at clojure.lang.Cons.next(Cons.java:37)
        at clojure.lang.RT.boundedLength(RT.java:1117)
        at clojure.lang.AFn.applyToHelper(AFn.java:168)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply__3243.doInvoke(core.clj:390)
        at clojure.lang.RestFn.invoke(RestFn.java:443)
        at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
        at clojure.lang.RestFn.invoke(RestFn.java:428)
        at clojure.contrib.zip_filter$descendants__48$fn__50.invoke
(zip_filter.clj:63)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.core$seq__3133.invoke(core.clj:103)
        at clojure.core$map__3815$fn__3817.invoke(core.clj:1502)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.Cons.next(Cons.java:37)
        at clojure.lang.RT.boundedLength(RT.java:1117)
        at clojure.lang.RestFn.applyTo(RestFn.java:135)
        at clojure.core$apply__3243.doInvoke(core.clj:390)
        at clojure.lang.RestFn.invoke(RestFn.java:428)
        at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
        at clojure.lang.RestFn.invoke(RestFn.java:428)
        at clojure.contrib.zip_filter$mapcat_chain__65$fn__67.invoke
(zip_filter.clj:88)
        at clojure.lang.ArraySeq.reduce(ArraySeq.java:116)
        at clojure.core$reduce__3319.invoke(core.clj:536)
        at clojure.contrib.zip_filter$mapcat_chain__65.invoke(zip_filter.clj:
89)
        at clojure.contrib.zip_filter.xml$xml__GT___119.doInvoke(xml.clj:75)
        at clojure.lang.RestFn.invoke(RestFn.java:460)
        at clojure.contrib.zip_filter.xml$text__102.invoke(xml.clj:43)
        at kanji.prkanji$get_kdic_info__147$fn__149.invoke(prkanji.clj:36)
        at clojure.contrib.zip_filter$fixup_apply__60.invoke(zip_filter.clj:
76)
        at clojure.contrib.zip_filter$mapcat_chain__65$fn__67$fn__69.invoke
(zip_filter.clj:88)
        at clojure.core$map__3815$fn__3817.invoke(core.clj:1503)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.core$seq__3133.invoke(core.clj:103)
        at clojure.core$spread__3240.invoke(core.clj:383)
        at clojure.core$apply__3243.doInvoke(core.clj:390)
        at clojure.lang.RestFn.invoke(RestFn.java:428)
        at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
        at clojure.lang.RestFn.invoke(RestFn.java:428)
        at clojure.contrib.zip_filter$mapcat_chain__65$fn__67.invoke
(zip_filter.clj:88)
        at clojure.lang.APersistentVector$Seq.reduce(APersistentVector.java:
476)
        at clojure.core$reduce__3319.invoke(core.clj:536)
        at clojure.contrib.zip_filter$mapcat_chain__65.invoke(zip_filter.clj:
89)
        at clojure.contrib.zip_filter.xml$xml__GT___119.doInvoke(xml.clj:75)
        at clojure.lang.RestFn.applyTo(RestFn.java:144)
        at clojure.core$apply__3243.doInvoke(core.clj:390)
        at clojure.lang.RestFn.invoke(RestFn.java:443)
        at clojure.contrib.zip_filter.xml$seq_test__111$fn__113.invoke
(xml.clj:55)
        at clojure.contrib.zip_filter$fixup_apply__60.invoke(zip_filter.clj:
76)
        at clojure.contrib.zip_filter$mapcat_chain__65$fn__67$fn__69.invoke
(zip_filter.clj:88)
        at clojure.core$map__3815$fn__3817.invoke(core.clj:1503)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.Cons.next(Cons.java:37)
        at clojure.lang.RT.next(RT.java:560)
        at clojure.core$next__3117.invoke(core.clj:50)
        at clojure.core$concat__3255$cat__3269$fn__3270.invoke(core.clj:428)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)
        at clojure.lang.RT.seq(RT.java:436)
        at clojure.lang.LazySeq.seq(LazySeq.java:41)

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to