Re: [Ohrrpgce] Reload.SerializeXML

Ralph Versteegen Sun, 30 May 2010 01:03:50 -0700

On 30 May 2010 19:30, Mike Caron <[email protected]> wrote:
> On 30/05/2010 2:53 AM, Ralph Versteegen wrote:
>>
>> On 28 May 2010 13:04, Mike Caron<[email protected]>  wrote:
>>>
>>> On 27/05/2010 8:38 PM, Ralph Versteegen wrote:
>>>>
>>>> On 28 May 2010 11:25, James Paige<[email protected]>    wrote:
>>>>>
>>>>> On Thu, May 27, 2010 at 07:07:38PM -0400, Mike Caron wrote:
>>>>>>
>>>>>> On 27/05/2010 6:38 PM, James Paige wrote:
>>>>>>>
>>>>>>> Mike, was there any special reason why Reload.SerializeXML uses print
>>>>>>> statements rather than writing to a file?
>>>>>>>
>>>>>>> ---
>>>>>>> James
>>>>>>
>>>>>> I wrote it as a debugging function. If it wrote to a file, then I
>>>>>> wouldn't be able to see it on screen in reloadtest! :)
>>>>>> --
>>>>>> Mike
>>>>>
>>>>> I guess what I am really looking for is a reload2xml command-line tool
>>>>> so I can easily debug reload files on disk..
>>>>>
>>>>> and actually it doesn't matter that SerializeXML prints to the console,
>>>>> because I could just do
>>>>>
>>>>>  reload2xml somefile.reld>    somefile.xml
>>>>>
>>>>> ---
>>>>> James
>>>>
>>>> Writing to standard output is the Unix way anyway!
>>>>
>>>> Speaking of xml, Mike mentioned a couple weeks ago that the Reload
>>>> code grinds to a halt when processing some translated 64MB xml
>>>> document. Since I enjoy optimisation to a rather evil degree, I'd like
>>>> to look at it sometime. What are some good testcases?
>>>
>>> Right at this moment, I don't feel like touching any of that stuff, so go
>>> nuts.
>>>
>>> A few tips:
>>>
>>> - I don't think the private heap has anything to do with potential
>>> performance issues. They're the same calls being done by the runtime,
>>> just
>>> in a different memory block.
>>> - One thing I never thought about doing is compiling with the -profile
>>> switch.
>>> - I never did any formal performance tuning, being as I subscribe to the
>>> "make it work, then make it fast" camp. All that ZString crap was to
>>> alleviate my fears of memory corruption caused by having Strings in UDTs.
>>>
>>> This is the document I mentioned: (warning: ~5 Megs compressed, 64 Megs
>>> uncompressed)
>>>
>>> http://taleotc.com/medline08n0059.zip
>>>
>>> I haven't looked to closely at the structure of the document, but it
>>> seems
>>> to be a fairly average, if large, dataset.
>>>
>>> Other than that, Google isn't being very friendly. Querying "xml test
>>> documents" lists a bunch of XML Tutorials and Unit testing stuff, while
>>> "large xml test documents" seems to focus on *really* big documents
>>> (like,>
>>> 1 Gb), for which I suspect the RELOAD file format would break down :)
>>>
>>> --
>>> Mike
>>
>> So here are my results (my machine is a 7 year old 3GHz pentium 4 with
>> 1GB of RAM):
>>
>> Beforehand:
>>
>> (I believe that this version of xml2reload did not include the
>> attributes, which add about 30% to the .reld size)
>>
>> bash-3.1$ xml2reload ../medline08n0059.xml medline.reld
>> Loaded XML document in 4839 ms
>> Parsed XML document in 74792 ms
>> Optimised document in 9859 ms
>> Serialized document in 1022907 ms
>> Tore down memory in 3891 ms
>> Finished in 1116305 ms
>>
>>
>> bash-3.1$ time reload2reload plotdict.reld plotdict2.reld
>> Loaded document in 30 ms
>> Serialized document in 2851 ms
>> Tore down memory in 1 ms
>> Finished in 2899 ms
>>
>> real    0m2.919s
>> user    0m0.112s
>> sys     0m0.188s
>>
>> (where reload2reload is obviously just a 10 liner)
>>
>> ============Afterwards:=======
>>
>> bash-3.1$ time xml2reload ../medline08n0059.xml medline.reld
>> Loaded XML document in 4207 ms
>> Parsed XML document in 6023 ms
>> Optimised document in 10631 ms
>> Serialized document in 2623 ms
>> Tore down memory in 3097 ms
>> Finished in 26596 ms
>>
>> real    0m26.699s
>> user    0m24.018s
>> sys     0m1.100s
>>
>> (Also, running reload2reload on medline.reld (which is 31MB) required
>> about 142MB of memory)
>>
>> bash-3.1$ reload2reload plotdict.reld plotdict2.reld
>> Loaded document in 13 ms
>> Serialized document in 21 ms
>> Tore down memory in 8 ms
>> Finished in 60 ms
>
> After running a few tests, I discovered the major bottleneck in the private
> heap implementation: A tiny bit of debugging code which didn't do a whole
> lot... except enumerate the entire heap on every allocation.


Ah, I was wondering why you said that in tests it took 20 min to
optimise the parsed XML when it only took me 12 seconds.

> When I got rid of it, everything sped up enormously!
>
> With private heap:
> Loading XML document...
> Loaded XML document in 5139 ms
> Starting memory usage: 301
> Parsing document...
> Parsed XML document in 6765 ms
> Memory usage: 237104215
> Optimised document in 11849 ms
> Memory usage: 157105826
> Serialized document in 1138 ms
> Tore down memory in 1974 ms
> Finished in 28314 ms
>
> Without private heap:
> Loading XML document...
> Loaded XML document in 4974 ms
> Starting memory usage: 0
> Parsing document...
> Parsed XML document in 6989 ms
> Memory usage: 0
> Optimised document in 12147 ms
> Memory usage: 0
> Serialized document in 1207 ms
> Tore down memory in 3257 ms
> Finished in 28577 ms

Interesting that in those runs it took less than half as long to
serialise as on my machine, when everything else is near equal.
Difference in C stdio implementation or kernel file system, I'd guess.

> I ran each a few times, it seems that they're all about the same. the only
> big difference I can see is the "Tore down memory" phase, which is indeed
> faster with the private heap (by 33%!) as intended. I suspect, though, that
> it's not enough to save it. Oh well, it was fun to get working, now it'll be
> fun to excise!

Hang on, "Tore down memory" includes freeing the XML document too. The
RELOAD doc teardown time could still be pretty significant. Try
instead timing opening a .reld and closing it.

> Anyway, the whole process used ~650 Megs of RAM, which is broken up as such:
> - Approximately 423 Megs used by libxml2 to load the XML document
> - Exactly 226.12 Megs (as seen above) used by RELOAD to load the unoptimized
> RELOAD document

That's bizarre; as I said, it took reload2reload 140MB to load the
resulting RELOAD document for me.

> I am going to assume that half of the difference is by the nodes' name
> strings alone.
>
> Anyway, mission accomplished, I guess. Now to find a 1Gig document, and see
> how long that takes :D
> --
> Mike

I would try it, but unfortunately libxml is obviously going to exhaust
the 2GB address space if you load in the whole thing. xml2reload on
that 64MB file requires 666MB of virtual memory here.
_______________________________________________
Ohrrpgce mailing list
[email protected]
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org

Re: [Ohrrpgce] Reload.SerializeXML

Reply via email to