Re: [Ohrrpgce] Reload.SerializeXML

Ralph Versteegen Sun, 30 May 2010 01:19:38 -0700

On 30 May 2010 20:10, Mike Caron <[email protected]> wrote:
> On 30/05/2010 4:03 AM, Ralph Versteegen wrote:
>>
>> On 30 May 2010 19:30, Mike Caron<[email protected]>  wrote:
>>>
>>> On 30/05/2010 2:53 AM, Ralph Versteegen wrote:
>>>>
>>>> On 28 May 2010 13:04, Mike Caron<[email protected]>    wrote:
>>>>>
>>>>> On 27/05/2010 8:38 PM, Ralph Versteegen wrote:
>>>>>>
>>>>>> On 28 May 2010 11:25, James Paige<[email protected]>      wrote:
>>>>>>>
>>>>>>> On Thu, May 27, 2010 at 07:07:38PM -0400, Mike Caron wrote:
>>>>>>>>
>>>>>>>> On 27/05/2010 6:38 PM, James Paige wrote:
>>>>>>>>>
>>>>>>>>> Mike, was there any special reason why Reload.SerializeXML uses
>>>>>>>>> print
>>>>>>>>> statements rather than writing to a file?
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> James
>>>>>>>>
>>>>>>>> I wrote it as a debugging function. If it wrote to a file, then I
>>>>>>>> wouldn't be able to see it on screen in reloadtest! :)
>>>>>>>> --
>>>>>>>> Mike
>>>>>>>
>>>>>>> I guess what I am really looking for is a reload2xml command-line
>>>>>>> tool
>>>>>>> so I can easily debug reload files on disk..
>>>>>>>
>>>>>>> and actually it doesn't matter that SerializeXML prints to the
>>>>>>> console,
>>>>>>> because I could just do
>>>>>>>
>>>>>>>  reload2xml somefile.reld>      somefile.xml
>>>>>>>
>>>>>>> ---
>>>>>>> James
>>>>>>
>>>>>> Writing to standard output is the Unix way anyway!
>>>>>>
>>>>>> Speaking of xml, Mike mentioned a couple weeks ago that the Reload
>>>>>> code grinds to a halt when processing some translated 64MB xml
>>>>>> document. Since I enjoy optimisation to a rather evil degree, I'd like
>>>>>> to look at it sometime. What are some good testcases?
>>>>>
>>>>> Right at this moment, I don't feel like touching any of that stuff, so
>>>>> go
>>>>> nuts.
>>>>>
>>>>> A few tips:
>>>>>
>>>>> - I don't think the private heap has anything to do with potential
>>>>> performance issues. They're the same calls being done by the runtime,
>>>>> just
>>>>> in a different memory block.
>>>>> - One thing I never thought about doing is compiling with the -profile
>>>>> switch.
>>>>> - I never did any formal performance tuning, being as I subscribe to
>>>>> the
>>>>> "make it work, then make it fast" camp. All that ZString crap was to
>>>>> alleviate my fears of memory corruption caused by having Strings in
>>>>> UDTs.
>>>>>
>>>>> This is the document I mentioned: (warning: ~5 Megs compressed, 64 Megs
>>>>> uncompressed)
>>>>>
>>>>> http://taleotc.com/medline08n0059.zip
>>>>>
>>>>> I haven't looked to closely at the structure of the document, but it
>>>>> seems
>>>>> to be a fairly average, if large, dataset.
>>>>>
>>>>> Other than that, Google isn't being very friendly. Querying "xml test
>>>>> documents" lists a bunch of XML Tutorials and Unit testing stuff, while
>>>>> "large xml test documents" seems to focus on *really* big documents
>>>>> (like,>
>>>>> 1 Gb), for which I suspect the RELOAD file format would break down :)
>>>>>
>>>>> --
>>>>> Mike
>>>>
>>>> So here are my results (my machine is a 7 year old 3GHz pentium 4 with
>>>> 1GB of RAM):
>>>>
>>>> Beforehand:
>>>>
>>>> (I believe that this version of xml2reload did not include the
>>>> attributes, which add about 30% to the .reld size)
>>>>
>>>> bash-3.1$ xml2reload ../medline08n0059.xml medline.reld
>>>> Loaded XML document in 4839 ms
>>>> Parsed XML document in 74792 ms
>>>> Optimised document in 9859 ms
>>>> Serialized document in 1022907 ms
>>>> Tore down memory in 3891 ms
>>>> Finished in 1116305 ms
>>>>
>>>>
>>>> bash-3.1$ time reload2reload plotdict.reld plotdict2.reld
>>>> Loaded document in 30 ms
>>>> Serialized document in 2851 ms
>>>> Tore down memory in 1 ms
>>>> Finished in 2899 ms
>>>>
>>>> real    0m2.919s
>>>> user    0m0.112s
>>>> sys     0m0.188s
>>>>
>>>> (where reload2reload is obviously just a 10 liner)
>>>>
>>>> ============Afterwards:=======
>>>>
>>>> bash-3.1$ time xml2reload ../medline08n0059.xml medline.reld
>>>> Loaded XML document in 4207 ms
>>>> Parsed XML document in 6023 ms
>>>> Optimised document in 10631 ms
>>>> Serialized document in 2623 ms
>>>> Tore down memory in 3097 ms
>>>> Finished in 26596 ms
>>>>
>>>> real    0m26.699s
>>>> user    0m24.018s
>>>> sys     0m1.100s
>>>>
>>>> (Also, running reload2reload on medline.reld (which is 31MB) required
>>>> about 142MB of memory)
>>>>
>>>> bash-3.1$ reload2reload plotdict.reld plotdict2.reld
>>>> Loaded document in 13 ms
>>>> Serialized document in 21 ms
>>>> Tore down memory in 8 ms
>>>> Finished in 60 ms
>>>
>>> After running a few tests, I discovered the major bottleneck in the
>>> private
>>> heap implementation: A tiny bit of debugging code which didn't do a whole
>>> lot... except enumerate the entire heap on every allocation.
>>
>> Ah, I was wondering why you said that in tests it took 20 min to
>> optimise the parsed XML when it only took me 12 seconds.
>>
>>> When I got rid of it, everything sped up enormously!
>>>
>>> With private heap:
>>> Loading XML document...
>>> Loaded XML document in 5139 ms
>>> Starting memory usage: 301
>>> Parsing document...
>>> Parsed XML document in 6765 ms
>>> Memory usage: 237104215
>>> Optimised document in 11849 ms
>>> Memory usage: 157105826
>>> Serialized document in 1138 ms
>>> Tore down memory in 1974 ms
>>> Finished in 28314 ms
>>>
>>> Without private heap:
>>> Loading XML document...
>>> Loaded XML document in 4974 ms
>>> Starting memory usage: 0
>>> Parsing document...
>>> Parsed XML document in 6989 ms
>>> Memory usage: 0
>>> Optimised document in 12147 ms
>>> Memory usage: 0
>>> Serialized document in 1207 ms
>>> Tore down memory in 3257 ms
>>> Finished in 28577 ms
>>
>> Interesting that in those runs it took less than half as long to
>> serialise as on my machine, when everything else is near equal.
>> Difference in C stdio implementation or kernel file system, I'd guess.
>
> Hmm, didn't notice that. Mildly strange!
>
>>> I ran each a few times, it seems that they're all about the same. the
>>> only
>>> big difference I can see is the "Tore down memory" phase, which is indeed
>>> faster with the private heap (by 33%!) as intended. I suspect, though,
>>> that
>>> it's not enough to save it. Oh well, it was fun to get working, now it'll
>>> be
>>> fun to excise!
>>
>> Hang on, "Tore down memory" includes freeing the XML document too. The
>> RELOAD doc teardown time could still be pretty significant. Try
>> instead timing opening a .reld and closing it.
>
> Hmm, you're right. Really, the XML document should be freed as soon as
> possible. I will investigate.
>
> That said, I've already pulled everything. I was just in the middle of
> testing to make sure I didn't break anything. But, I'll try this before I
> commit it.
>
>>> Anyway, the whole process used ~650 Megs of RAM, which is broken up as
>>> such:
>>> - Approximately 423 Megs used by libxml2 to load the XML document
>>> - Exactly 226.12 Megs (as seen above) used by RELOAD to load the
>>> unoptimized
>>> RELOAD document
>>
>> That's bizarre; as I said, it took reload2reload 140MB to load the
>> resulting RELOAD document for me.
>
> Yes, that's the optimized version:
>
> Optimised document in 11849 ms
> Memory usage: 157105826 [ == 149.8 Megs]


Whoops, misread.

>>> I am going to assume that half of the difference is by the nodes' name
>>> strings alone.
>>>
>>> Anyway, mission accomplished, I guess. Now to find a 1Gig document, and
>>> see
>>> how long that takes :D
>>> --
>>> Mike
>>
>> I would try it, but unfortunately libxml is obviously going to exhaust
>> the 2GB address space if you load in the whole thing. xml2reload on
>> that 64MB file requires 666MB of virtual memory here.
>
> Hmm... We need someone with a 64-bit processor. And, a 64-bit version of
> libxml (which I assume, without looking, exists).

I have those, but I do not have a 64-bit FreeBasic compiler, wrecking
the experiment.

> --
> Mike
> _______________________________________________
> Ohrrpgce mailing list
> [email protected]
> http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org
>
_______________________________________________
Ohrrpgce mailing list
[email protected]
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org

Re: [Ohrrpgce] Reload.SerializeXML

Reply via email to