Re: [Ohrrpgce] Reload.SerializeXML

Ralph Versteegen Sun, 30 May 2010 01:33:42 -0700

On 30 May 2010 20:23, Mike Caron <[email protected]> wrote:
> On 30/05/2010 4:19 AM, Ralph Versteegen wrote:
>>
>> On 30 May 2010 20:10, Mike Caron<[email protected]>  wrote:
>>>
>>> On 30/05/2010 4:03 AM, Ralph Versteegen wrote:
>>>>
>>>> On 30 May 2010 19:30, Mike Caron<[email protected]>    wrote:
>>>>>
>>>>> On 30/05/2010 2:53 AM, Ralph Versteegen wrote:
>>>>>>
>>>>>> On 28 May 2010 13:04, Mike Caron<[email protected]>      wrote:
>>>>>>>
>>>>>>> On 27/05/2010 8:38 PM, Ralph Versteegen wrote:
>>>>>>>>
>>>>>>>> On 28 May 2010 11:25, James Paige<[email protected]>
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>> On Thu, May 27, 2010 at 07:07:38PM -0400, Mike Caron wrote:
>>>>>>>>>>
>>>>>>>>>> On 27/05/2010 6:38 PM, James Paige wrote:
>>>>>>>>>>>
>>>>>>>>>>> Mike, was there any special reason why Reload.SerializeXML uses
>>>>>>>>>>> print
>>>>>>>>>>> statements rather than writing to a file?
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> James
>>>>>>>>>>
>>>>>>>>>> I wrote it as a debugging function. If it wrote to a file, then I
>>>>>>>>>> wouldn't be able to see it on screen in reloadtest! :)
>>>>>>>>>> --
>>>>>>>>>> Mike
>>>>>>>>>
>>>>>>>>> I guess what I am really looking for is a reload2xml command-line
>>>>>>>>> tool
>>>>>>>>> so I can easily debug reload files on disk..
>>>>>>>>>
>>>>>>>>> and actually it doesn't matter that SerializeXML prints to the
>>>>>>>>> console,
>>>>>>>>> because I could just do
>>>>>>>>>
>>>>>>>>>  reload2xml somefile.reld>        somefile.xml
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> James
>>>>>>>>
>>>>>>>> Writing to standard output is the Unix way anyway!
>>>>>>>>
>>>>>>>> Speaking of xml, Mike mentioned a couple weeks ago that the Reload
>>>>>>>> code grinds to a halt when processing some translated 64MB xml
>>>>>>>> document. Since I enjoy optimisation to a rather evil degree, I'd
>>>>>>>> like
>>>>>>>> to look at it sometime. What are some good testcases?
>>>>>>>
>>>>>>> Right at this moment, I don't feel like touching any of that stuff,
>>>>>>> so
>>>>>>> go
>>>>>>> nuts.
>>>>>>>
>>>>>>> A few tips:
>>>>>>>
>>>>>>> - I don't think the private heap has anything to do with potential
>>>>>>> performance issues. They're the same calls being done by the runtime,
>>>>>>> just
>>>>>>> in a different memory block.
>>>>>>> - One thing I never thought about doing is compiling with the
>>>>>>> -profile
>>>>>>> switch.
>>>>>>> - I never did any formal performance tuning, being as I subscribe to
>>>>>>> the
>>>>>>> "make it work, then make it fast" camp. All that ZString crap was to
>>>>>>> alleviate my fears of memory corruption caused by having Strings in
>>>>>>> UDTs.
>>>>>>>
>>>>>>> This is the document I mentioned: (warning: ~5 Megs compressed, 64
>>>>>>> Megs
>>>>>>> uncompressed)
>>>>>>>
>>>>>>> http://taleotc.com/medline08n0059.zip
>>>>>>>
>>>>>>> I haven't looked to closely at the structure of the document, but it
>>>>>>> seems
>>>>>>> to be a fairly average, if large, dataset.
>>>>>>>
>>>>>>> Other than that, Google isn't being very friendly. Querying "xml test
>>>>>>> documents" lists a bunch of XML Tutorials and Unit testing stuff,
>>>>>>> while
>>>>>>> "large xml test documents" seems to focus on *really* big documents
>>>>>>> (like,>
>>>>>>> 1 Gb), for which I suspect the RELOAD file format would break down :)
>>>>>>>
>>>>>>> --
>>>>>>> Mike
>>>>>>
>>>>>> So here are my results (my machine is a 7 year old 3GHz pentium 4 with
>>>>>> 1GB of RAM):
>>>>>>
>>>>>> Beforehand:
>>>>>>
>>>>>> (I believe that this version of xml2reload did not include the
>>>>>> attributes, which add about 30% to the .reld size)
>>>>>>
>>>>>> bash-3.1$ xml2reload ../medline08n0059.xml medline.reld
>>>>>> Loaded XML document in 4839 ms
>>>>>> Parsed XML document in 74792 ms
>>>>>> Optimised document in 9859 ms
>>>>>> Serialized document in 1022907 ms
>>>>>> Tore down memory in 3891 ms
>>>>>> Finished in 1116305 ms
>>>>>>
>>>>>>
>>>>>> bash-3.1$ time reload2reload plotdict.reld plotdict2.reld
>>>>>> Loaded document in 30 ms
>>>>>> Serialized document in 2851 ms
>>>>>> Tore down memory in 1 ms
>>>>>> Finished in 2899 ms
>>>>>>
>>>>>> real    0m2.919s
>>>>>> user    0m0.112s
>>>>>> sys     0m0.188s
>>>>>>
>>>>>> (where reload2reload is obviously just a 10 liner)
>>>>>>
>>>>>> ============Afterwards:=======
>>>>>>
>>>>>> bash-3.1$ time xml2reload ../medline08n0059.xml medline.reld
>>>>>> Loaded XML document in 4207 ms
>>>>>> Parsed XML document in 6023 ms
>>>>>> Optimised document in 10631 ms
>>>>>> Serialized document in 2623 ms
>>>>>> Tore down memory in 3097 ms
>>>>>> Finished in 26596 ms
>>>>>>
>>>>>> real    0m26.699s
>>>>>> user    0m24.018s
>>>>>> sys     0m1.100s
>>>>>>
>>>>>> (Also, running reload2reload on medline.reld (which is 31MB) required
>>>>>> about 142MB of memory)
>>>>>>
>>>>>> bash-3.1$ reload2reload plotdict.reld plotdict2.reld
>>>>>> Loaded document in 13 ms
>>>>>> Serialized document in 21 ms
>>>>>> Tore down memory in 8 ms
>>>>>> Finished in 60 ms
>>>>>
>>>>> After running a few tests, I discovered the major bottleneck in the
>>>>> private
>>>>> heap implementation: A tiny bit of debugging code which didn't do a
>>>>> whole
>>>>> lot... except enumerate the entire heap on every allocation.
>>>>
>>>> Ah, I was wondering why you said that in tests it took 20 min to
>>>> optimise the parsed XML when it only took me 12 seconds.
>>>>
>>>>> When I got rid of it, everything sped up enormously!
>>>>>
>>>>> With private heap:
>>>>> Loading XML document...
>>>>> Loaded XML document in 5139 ms
>>>>> Starting memory usage: 301
>>>>> Parsing document...
>>>>> Parsed XML document in 6765 ms
>>>>> Memory usage: 237104215
>>>>> Optimised document in 11849 ms
>>>>> Memory usage: 157105826
>>>>> Serialized document in 1138 ms
>>>>> Tore down memory in 1974 ms
>>>>> Finished in 28314 ms
>>>>>
>>>>> Without private heap:
>>>>> Loading XML document...
>>>>> Loaded XML document in 4974 ms
>>>>> Starting memory usage: 0
>>>>> Parsing document...
>>>>> Parsed XML document in 6989 ms
>>>>> Memory usage: 0
>>>>> Optimised document in 12147 ms
>>>>> Memory usage: 0
>>>>> Serialized document in 1207 ms
>>>>> Tore down memory in 3257 ms
>>>>> Finished in 28577 ms
>>>>
>>>> Interesting that in those runs it took less than half as long to
>>>> serialise as on my machine, when everything else is near equal.
>>>> Difference in C stdio implementation or kernel file system, I'd guess.
>>>
>>> Hmm, didn't notice that. Mildly strange!
>>>
>>>>> I ran each a few times, it seems that they're all about the same. the
>>>>> only
>>>>> big difference I can see is the "Tore down memory" phase, which is
>>>>> indeed
>>>>> faster with the private heap (by 33%!) as intended. I suspect, though,
>>>>> that
>>>>> it's not enough to save it. Oh well, it was fun to get working, now
>>>>> it'll
>>>>> be
>>>>> fun to excise!
>>>>
>>>> Hang on, "Tore down memory" includes freeing the XML document too. The
>>>> RELOAD doc teardown time could still be pretty significant. Try
>>>> instead timing opening a .reld and closing it.
>>>
>>> Hmm, you're right. Really, the XML document should be freed as soon as
>>> possible. I will investigate.
>>>
>>> That said, I've already pulled everything. I was just in the middle of
>>> testing to make sure I didn't break anything. But, I'll try this before I
>>> commit it.
>
>
> Standard heap, proper timing of RELOAD:
>
> D:\ohrrpgce>xml2reload medline08n0059.xml medline.rld
> Loading XML document...
> Loaded XML document in 5189 ms
> Parsing document...
> Parsed XML document in 7267 ms
> Freeing XML document...
> Freed XML document in 1696 ms
> Optimised document in 12783 ms
> Serialized document in 1265 ms
> Tore down memory in 1399 ms
> Finished in 29612 ms
>
> And, private heap, proper timing of RELOAD:
>
> D:\ohrrpgce>xml2reload medline08n0059.xml medline.rld
> Loading XML document...
> Loaded XML document in 5285 ms
> Parsing document...
> Parsed XML document in 7468 ms
> Freeing XML document...
> Freed XML document in 1857 ms
> Optimised document in 12148 ms
> Serialized document in 1087 ms
> Tore down memory in 84 ms
> Finished in 27944 ms
>
> Okay, to quote my favourite TV personality, "Now THAT was a result!"
>
> But, the question is whether or not two orders of magnitude is worth keeping
> it?


Well, the only numbers which are relevant are

 Serialized document in 1265 ms
 Tore down memory in 1399 ms

to

 Serialized document in 1087 ms
 Tore down memory in 84 ms

If it took ~1s to serialise, it might take <2s to load that .reld file
based on what I'm seeing, and in that case saving 1.3s on freeing the
document would be a pretty good saving.

>>>>> Anyway, the whole process used ~650 Megs of RAM, which is broken up as
>>>>> such:
>>>>> - Approximately 423 Megs used by libxml2 to load the XML document
>>>>> - Exactly 226.12 Megs (as seen above) used by RELOAD to load the
>>>>> unoptimized
>>>>> RELOAD document
>>>>
>>>> That's bizarre; as I said, it took reload2reload 140MB to load the
>>>> resulting RELOAD document for me.
>>>
>>> Yes, that's the optimized version:
>>>
>>> Optimised document in 11849 ms
>>> Memory usage: 157105826 [ == 149.8 Megs]
>>
>> Whoops, misread.
>>
>>>>> I am going to assume that half of the difference is by the nodes' name
>>>>> strings alone.
>>>>>
>>>>> Anyway, mission accomplished, I guess. Now to find a 1Gig document, and
>>>>> see
>>>>> how long that takes :D
>>>>> --
>>>>> Mike
>>>>
>>>> I would try it, but unfortunately libxml is obviously going to exhaust
>>>> the 2GB address space if you load in the whole thing. xml2reload on
>>>> that 64MB file requires 666MB of virtual memory here.
>>>
>>> Hmm... We need someone with a 64-bit processor. And, a 64-bit version of
>>> libxml (which I assume, without looking, exists).
>>
>> I have those, but I do not have a 64-bit FreeBasic compiler, wrecking
>> the experiment.
>
> That is in fact a problem.
>
> However, to work around this, we just need to rewrite RELOAD in C, and we're
> good to go! Isn't there a FB -> C++ compiler that will do the trick? ;)

Not quite! You'll have better luck with fbc -gen gcc, which I have yet
to try out, but which probably isn't totally working, but I wouldn't
expect much hand editing is needed to get the output into a compilable
state.

>>
>>> --
>>> Mike
>>> _______________________________________________
>>> Ohrrpgce mailing list
>>> [email protected]
>>> http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org
>>>
>> _______________________________________________
>> Ohrrpgce mailing list
>> [email protected]
>> http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org
>
>
> --
> Mike
> _______________________________________________
> Ohrrpgce mailing list
> [email protected]
> http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org
>
_______________________________________________
Ohrrpgce mailing list
[email protected]
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org

Re: [Ohrrpgce] Reload.SerializeXML

Reply via email to