[Haskell-cafe] Re: XML parser recommendation?

2007-10-24 Thread apfelmus

Uwe Schmidt wrote:

The hashtable approach would of course reduce memory usage,


Note that hashtables are evil :) I'm all for tries instead.


but this
would require a global change of the processing model: A document then
does not longer consist of a single tree, it alway consists of a pair of a tree 
and a map.


Ah! I got struck by a trick: it's possible to store every tag name in 
full only once but still present a simple tree with full tag names to 
the user. You simply share all the tag names. For instance, in


  let x = mytagname in Tree (Tag x) [Tree (Tag x) [Text foobar]]

the string mytagname is stored in memory only once although there are 
two tags with this name.


When parsing an XML-file, you look up every parsed tag name in a finite 
map (i.e. the trie). If it's already in, you don't store the parsed tag 
name in the XML tree but the one stored at the leaf in the trie. Of 
course, these two strings are equal. But they're not (pointer-) 
identical! After parsing, all equal tag names now are pointers to one 
and the same leaf in the finite map. You can drop the finite map 
structure afterwards, the pointers to the leaves will remain.


That would be quite the memory reduction for large XML trees which are 
likely to have many many identical tag names.


Regards,
apfelmus

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: XML parser recommendation?

2007-10-24 Thread Uwe Schmidt
apfelmus wrote:

 Ah! I got struck by a trick: it's possible to store every tag name in
 full only once but still present a simple tree with full tag names to
 the user. You simply share all the tag names. For instance, in

let x = mytagname in Tree (Tag x) [Tree (Tag x) [Text foobar]]

 the string mytagname is stored in memory only once although there are
 two tags with this name.

 When parsing an XML-file, you look up every parsed tag name in a finite
 map (i.e. the trie). If it's already in, you don't store the parsed tag
 name in the XML tree but the one stored at the leaf in the trie. Of
 course, these two strings are equal. But they're not (pointer-)
 identical! After parsing, all equal tag names now are pointers to one
 and the same leaf in the finite map. You can drop the finite map
 structure afterwards, the pointers to the leaves will remain.

 That would be quite the memory reduction for large XML trees which are
 likely to have many many identical tag names.

I've also thought about this approach. It sounds a bit weired,
to built a map (or a trie) for the identity function. But it would
solve a part of the space problem, at least when building XML
documents by parsing.
So I guess, there is a new project:
A simple, small and lazy parser (no parsec), at least for the content
part of XML.

Uwe
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: XML parser recommendation?

2007-10-23 Thread Rene de Visser
Uwe Schmidt [EMAIL PROTECTED] schrieb im Newsbeitrag 
news:[EMAIL PROTECTED]
it into HXT.

 This still does not solve the processing of very very large
 XML document. I doubt, whether we can do this with a DOM
 like approach, as in HXT or HaXml. Lazy input does not solve all problems.
 A SAX like parser could be a more useful choice for very large documents.

 Uwe

I think a step towards support medium size documents in HXT would be to 
store the tags and content more efficiently.
If I undertand the coding correctly every tag is stored as a seperate 
Haskell string. As each byte of a string under GHC takes 12 bytes this alone 
leads to high memory usage. Tags tend to repeat. You could store them 
uniquely using a hash table. Content could be stored in compressed byte 
strings.

As I mentioned in an earlier post 2GB memory is not enough to process a 35MB 
XML document in HXT as we have

30 x 2 x 12 = 720 MB for starters to just store the string data (once in the 
parser and once in the DOM).

(Well a machine with 2GB memory). I guess I had somewhere around 1GB free 
for the program. Other overheads most likely used up the ramaining 300 MB.

Rene. 



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: XML parser recommendation?

2007-10-23 Thread Ketil Malde
Rene de Visser [EMAIL PROTECTED] writes:

 If I undertand the coding correctly every tag is stored as a seperate 
 Haskell string. As each byte of a string under GHC takes 12 bytes this alone 
 leads to high memory usage.

Not that it detracts from your point, but I guess that is 24 bytes per
character on 64 bit machinery? :-)

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: XML parser recommendation?

2007-10-22 Thread Rene de Visser
Yitzchak Gale [EMAIL PROTECTED] schrieb im Newsbeitrag 
news:[EMAIL PROTECTED]
 Henning Thielemann wrote:
 HXT uses Parsec, which is strict.
I had a look at using HXT awhile ago. Parsec is the least of the problems.
HXT stores the XML as an explicit tree in memory, where the head has explict 
references to the children.
This means that the whole XML tree is stored in memory until the last child 
is processed. Also this tree is stored ineffeciently. Everything as non 
shared Haskell strings. My experience is that a 30MB file (which is quite 
small for an XML file) can NOT be processed with 2GB memory.

Rene. 



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: XML parser recommendation?

2007-10-22 Thread Justin Bailey
On 10/22/07, Rene de Visser [EMAIL PROTECTED] wrote:

 I had a look at using HXT awhile ago. Parsec is the least of the problems.
 HXT stores the XML as an explicit tree in memory, where the head has
 explict
 references to the children.


What did you end up using? I've started building an app based on HXT and
that worries me. The fact that it produces a 10MB library file is also
worrisome.

Justin
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe