Re: parsing/processing of big xml files...

2010-01-06 Thread Graham Fawcett
Hi Alex,

On Wed, Jan 6, 2010 at 9:06 AM, Alex Ott  wrote:
> Hello all
>
> I have question about processing big XML files with lazy-xml.  I'm trying to 
> analyze
> StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with
> posts, i get java stack overflow, although i provide enough memory for java
> (1Gb of heap).

Someone asked this question a while back, and a suggestion given was
to use Mark Triggs' XOM wrapper:

http://github.com/marktriggs/xml-picker-seq

Thread:
http://groups.google.com/group/clojure/browse_thread/thread/365ca7aaaf8d55b7?pli=1

Cheers,
Graham

>
> My code looks following way
>
>
> (ns stackoverflow
>  (:import java.io.File)
>  (:use clojure.contrib.lazy-xml))
>
> (def so-base "/data-sets/stack-overflow/2009-12/122009 SO")
>
> (def posts-file (File. (str so-base "/posts.xml")))
>
> (defn count-post-entries [xml]
>  (loop [counter 0
>         lst xml]
>    (if (nil? lst)
>      counter
>      (let [elem (first lst)
>            rst (rest lst)]
>        (if (and (= (:type elem) :start-element) (= (:name elem) :row))
>          (recur (+ 1 counter) rst)
>          (recur counter rst))
>
> and run it with
>
> (stackoverflow/count-post-entries (clojure.contrib.lazy-xml/parse-seq 
> stackoverflow/posts-file))
>
> I don't collect real data here, so i expect, that clojure will discard
> already processed data.
>
> The same problem with stack overflow happens, when i use reduce:
>
> (reduce (fn [counter elem]
>          (if (and (= (:type elem) :start-element) (= (:name elem) :row))
>            (+ 1 counter)
>            counter))
>        0 (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file))
>
> So, question is open - how to process big xml files in constant space? (if
> I won't collect much data during processing)
>
> --
> With best wishes, Alex Ott, MBA
> http://alexott.blogspot.com/           http://xtalk.msk.su/~ott/
> http://alexott-ru.blogspot.com/
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

parsing/processing of big xml files...

2010-01-06 Thread Alex Ott
Hello all

I have question about processing big XML files with lazy-xml.  I'm trying to 
analyze
StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with
posts, i get java stack overflow, although i provide enough memory for java
(1Gb of heap).

My code looks following way


(ns stackoverflow
  (:import java.io.File)
  (:use clojure.contrib.lazy-xml))

(def so-base "/data-sets/stack-overflow/2009-12/122009 SO")

(def posts-file (File. (str so-base "/posts.xml")))

(defn count-post-entries [xml]
  (loop [counter 0
 lst xml]
(if (nil? lst)
  counter
  (let [elem (first lst)
rst (rest lst)]
(if (and (= (:type elem) :start-element) (= (:name elem) :row))
  (recur (+ 1 counter) rst)
  (recur counter rst))

and run it with 

(stackoverflow/count-post-entries (clojure.contrib.lazy-xml/parse-seq 
stackoverflow/posts-file))

I don't collect real data here, so i expect, that clojure will discard
already processed data.

The same problem with stack overflow happens, when i use reduce:

(reduce (fn [counter elem]
  (if (and (= (:type elem) :start-element) (= (:name elem) :row))
(+ 1 counter)
counter))
0 (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file))

So, question is open - how to process big xml files in constant space? (if
I won't collect much data during processing)

-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/   http://xtalk.msk.su/~ott/
http://alexott-ru.blogspot.com/
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en