Re: [Ohrrpgce] Reload.SerializeXML

Mike Caron Sun, 30 May 2010 00:30:34 -0700

On 30/05/2010 2:53 AM, Ralph Versteegen wrote:

On 28 May 2010 13:04, Mike Caron<[email protected]>  wrote:

On 27/05/2010 8:38 PM, Ralph Versteegen wrote:


On 28 May 2010 11:25, James Paige<[email protected]>    wrote:


On Thu, May 27, 2010 at 07:07:38PM -0400, Mike Caron wrote:


On 27/05/2010 6:38 PM, James Paige wrote:


Mike, was there any special reason why Reload.SerializeXML uses print
statements rather than writing to a file?

---
James


I wrote it as a debugging function. If it wrote to a file, then I
wouldn't be able to see it on screen in reloadtest! :)
--
Mike


I guess what I am really looking for is a reload2xml command-line tool
so I can easily debug reload files on disk..

and actually it doesn't matter that SerializeXML prints to the console,
because I could just do

  reload2xml somefile.reld>    somefile.xml

---
James


Writing to standard output is the Unix way anyway!

Speaking of xml, Mike mentioned a couple weeks ago that the Reload
code grinds to a halt when processing some translated 64MB xml
document. Since I enjoy optimisation to a rather evil degree, I'd like
to look at it sometime. What are some good testcases?


Right at this moment, I don't feel like touching any of that stuff, so go
nuts.

A few tips:

- I don't think the private heap has anything to do with potential
performance issues. They're the same calls being done by the runtime, just
in a different memory block.
- One thing I never thought about doing is compiling with the -profile
switch.
- I never did any formal performance tuning, being as I subscribe to the
"make it work, then make it fast" camp. All that ZString crap was to
alleviate my fears of memory corruption caused by having Strings in UDTs.

This is the document I mentioned: (warning: ~5 Megs compressed, 64 Megs
uncompressed)

http://taleotc.com/medline08n0059.zip

I haven't looked to closely at the structure of the document, but it seems
to be a fairly average, if large, dataset.

Other than that, Google isn't being very friendly. Querying "xml test
documents" lists a bunch of XML Tutorials and Unit testing stuff, while
"large xml test documents" seems to focus on *really* big documents (like,>
1 Gb), for which I suspect the RELOAD file format would break down :)

--
Mike


So here are my results (my machine is a 7 year old 3GHz pentium 4 with
1GB of RAM):

Beforehand:

(I believe that this version of xml2reload did not include the
attributes, which add about 30% to the .reld size)

bash-3.1$ xml2reload ../medline08n0059.xml medline.reld
Loaded XML document in 4839 ms
Parsed XML document in 74792 ms
Optimised document in 9859 ms
Serialized document in 1022907 ms
Tore down memory in 3891 ms
Finished in 1116305 ms


bash-3.1$ time reload2reload plotdict.reld plotdict2.reld
Loaded document in 30 ms
Serialized document in 2851 ms
Tore down memory in 1 ms
Finished in 2899 ms

real    0m2.919s
user    0m0.112s
sys     0m0.188s

(where reload2reload is obviously just a 10 liner)

============Afterwards:=======

bash-3.1$ time xml2reload ../medline08n0059.xml medline.reld
Loaded XML document in 4207 ms
Parsed XML document in 6023 ms
Optimised document in 10631 ms
Serialized document in 2623 ms
Tore down memory in 3097 ms
Finished in 26596 ms

real    0m26.699s
user    0m24.018s
sys     0m1.100s

(Also, running reload2reload on medline.reld (which is 31MB) required
about 142MB of memory)

bash-3.1$ reload2reload plotdict.reld plotdict2.reld
Loaded document in 13 ms
Serialized document in 21 ms
Tore down memory in 8 ms
Finished in 60 ms

After running a few tests, I discovered the major bottleneck in theprivate heap implementation: A tiny bit of debugging code which didn'tdo a whole lot... except enumerate the entire heap on every allocation.


When I got rid of it, everything sped up enormously!

With private heap:
Loading XML document...
Loaded XML document in 5139 ms
Starting memory usage: 301
Parsing document...
Parsed XML document in 6765 ms
Memory usage: 237104215
Optimised document in 11849 ms
Memory usage: 157105826
Serialized document in 1138 ms
Tore down memory in 1974 ms
Finished in 28314 ms

Without private heap:
Loading XML document...
Loaded XML document in 4974 ms
Starting memory usage: 0
Parsing document...
Parsed XML document in 6989 ms
Memory usage: 0
Optimised document in 12147 ms
Memory usage: 0
Serialized document in 1207 ms
Tore down memory in 3257 ms
Finished in 28577 ms

I ran each a few times, it seems that they're all about the same. theonly big difference I can see is the "Tore down memory" phase, which isindeed faster with the private heap (by 33%!) as intended. I suspect,though, that it's not enough to save it. Oh well, it was fun to getworking, now it'll be fun to excise!


Anyway, the whole process used ~650 Megs of RAM, which is broken up as such:
- Approximately 423 Megs used by libxml2 to load the XML document

- Exactly 226.12 Megs (as seen above) used by RELOAD to load theunoptimized RELOAD document

I am going to assume that half of the difference is by the nodes' namestrings alone.

Anyway, mission accomplished, I guess. Now to find a 1Gig document, andsee how long that takes :D

--
Mike
_______________________________________________
Ohrrpgce mailing list
[email protected]
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org

Re: [Ohrrpgce] Reload.SerializeXML

Reply via email to