Hi Alex,
On Wed, Jan 6, 2010 at 9:06 AM, Alex Ott wrote:
> Hello all
>
> I have question about processing big XML files with lazy-xml. I'm trying to
> analyze
> StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with
> posts, i get java stack overflow, although i provide enough memory for java
> (1Gb of heap).
Someone asked this question a while back, and a suggestion given was
to use Mark Triggs' XOM wrapper:
http://github.com/marktriggs/xml-picker-seq
Thread:
http://groups.google.com/group/clojure/browse_thread/thread/365ca7aaaf8d55b7?pli=1
Cheers,
Graham
>
> My code looks following way
>
>
> (ns stackoverflow
> (:import java.io.File)
> (:use clojure.contrib.lazy-xml))
>
> (def so-base "/data-sets/stack-overflow/2009-12/122009 SO")
>
> (def posts-file (File. (str so-base "/posts.xml")))
>
> (defn count-post-entries [xml]
> (loop [counter 0
> lst xml]
> (if (nil? lst)
> counter
> (let [elem (first lst)
> rst (rest lst)]
> (if (and (= (:type elem) :start-element) (= (:name elem) :row))
> (recur (+ 1 counter) rst)
> (recur counter rst))
>
> and run it with
>
> (stackoverflow/count-post-entries (clojure.contrib.lazy-xml/parse-seq
> stackoverflow/posts-file))
>
> I don't collect real data here, so i expect, that clojure will discard
> already processed data.
>
> The same problem with stack overflow happens, when i use reduce:
>
> (reduce (fn [counter elem]
> (if (and (= (:type elem) :start-element) (= (:name elem) :row))
> (+ 1 counter)
> counter))
> 0 (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file))
>
> So, question is open - how to process big xml files in constant space? (if
> I won't collect much data during processing)
>
> --
> With best wishes, Alex Ott, MBA
> http://alexott.blogspot.com/ http://xtalk.msk.su/~ott/
> http://alexott-ru.blogspot.com/
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en