On 30 May 2010 20:10, Mike Caron <[email protected]> wrote: > On 30/05/2010 4:03 AM, Ralph Versteegen wrote: >> >> On 30 May 2010 19:30, Mike Caron<[email protected]> wrote: >>> >>> On 30/05/2010 2:53 AM, Ralph Versteegen wrote: >>>> >>>> On 28 May 2010 13:04, Mike Caron<[email protected]> wrote: >>>>> >>>>> On 27/05/2010 8:38 PM, Ralph Versteegen wrote: >>>>>> >>>>>> On 28 May 2010 11:25, James Paige<[email protected]> wrote: >>>>>>> >>>>>>> On Thu, May 27, 2010 at 07:07:38PM -0400, Mike Caron wrote: >>>>>>>> >>>>>>>> On 27/05/2010 6:38 PM, James Paige wrote: >>>>>>>>> >>>>>>>>> Mike, was there any special reason why Reload.SerializeXML uses >>>>>>>>> print >>>>>>>>> statements rather than writing to a file? >>>>>>>>> >>>>>>>>> --- >>>>>>>>> James >>>>>>>> >>>>>>>> I wrote it as a debugging function. If it wrote to a file, then I >>>>>>>> wouldn't be able to see it on screen in reloadtest! :) >>>>>>>> -- >>>>>>>> Mike >>>>>>> >>>>>>> I guess what I am really looking for is a reload2xml command-line >>>>>>> tool >>>>>>> so I can easily debug reload files on disk.. >>>>>>> >>>>>>> and actually it doesn't matter that SerializeXML prints to the >>>>>>> console, >>>>>>> because I could just do >>>>>>> >>>>>>> reload2xml somefile.reld> somefile.xml >>>>>>> >>>>>>> --- >>>>>>> James >>>>>> >>>>>> Writing to standard output is the Unix way anyway! >>>>>> >>>>>> Speaking of xml, Mike mentioned a couple weeks ago that the Reload >>>>>> code grinds to a halt when processing some translated 64MB xml >>>>>> document. Since I enjoy optimisation to a rather evil degree, I'd like >>>>>> to look at it sometime. What are some good testcases? >>>>> >>>>> Right at this moment, I don't feel like touching any of that stuff, so >>>>> go >>>>> nuts. >>>>> >>>>> A few tips: >>>>> >>>>> - I don't think the private heap has anything to do with potential >>>>> performance issues. They're the same calls being done by the runtime, >>>>> just >>>>> in a different memory block. >>>>> - One thing I never thought about doing is compiling with the -profile >>>>> switch. >>>>> - I never did any formal performance tuning, being as I subscribe to >>>>> the >>>>> "make it work, then make it fast" camp. All that ZString crap was to >>>>> alleviate my fears of memory corruption caused by having Strings in >>>>> UDTs. >>>>> >>>>> This is the document I mentioned: (warning: ~5 Megs compressed, 64 Megs >>>>> uncompressed) >>>>> >>>>> http://taleotc.com/medline08n0059.zip >>>>> >>>>> I haven't looked to closely at the structure of the document, but it >>>>> seems >>>>> to be a fairly average, if large, dataset. >>>>> >>>>> Other than that, Google isn't being very friendly. Querying "xml test >>>>> documents" lists a bunch of XML Tutorials and Unit testing stuff, while >>>>> "large xml test documents" seems to focus on *really* big documents >>>>> (like,> >>>>> 1 Gb), for which I suspect the RELOAD file format would break down :) >>>>> >>>>> -- >>>>> Mike >>>> >>>> So here are my results (my machine is a 7 year old 3GHz pentium 4 with >>>> 1GB of RAM): >>>> >>>> Beforehand: >>>> >>>> (I believe that this version of xml2reload did not include the >>>> attributes, which add about 30% to the .reld size) >>>> >>>> bash-3.1$ xml2reload ../medline08n0059.xml medline.reld >>>> Loaded XML document in 4839 ms >>>> Parsed XML document in 74792 ms >>>> Optimised document in 9859 ms >>>> Serialized document in 1022907 ms >>>> Tore down memory in 3891 ms >>>> Finished in 1116305 ms >>>> >>>> >>>> bash-3.1$ time reload2reload plotdict.reld plotdict2.reld >>>> Loaded document in 30 ms >>>> Serialized document in 2851 ms >>>> Tore down memory in 1 ms >>>> Finished in 2899 ms >>>> >>>> real 0m2.919s >>>> user 0m0.112s >>>> sys 0m0.188s >>>> >>>> (where reload2reload is obviously just a 10 liner) >>>> >>>> ============Afterwards:======= >>>> >>>> bash-3.1$ time xml2reload ../medline08n0059.xml medline.reld >>>> Loaded XML document in 4207 ms >>>> Parsed XML document in 6023 ms >>>> Optimised document in 10631 ms >>>> Serialized document in 2623 ms >>>> Tore down memory in 3097 ms >>>> Finished in 26596 ms >>>> >>>> real 0m26.699s >>>> user 0m24.018s >>>> sys 0m1.100s >>>> >>>> (Also, running reload2reload on medline.reld (which is 31MB) required >>>> about 142MB of memory) >>>> >>>> bash-3.1$ reload2reload plotdict.reld plotdict2.reld >>>> Loaded document in 13 ms >>>> Serialized document in 21 ms >>>> Tore down memory in 8 ms >>>> Finished in 60 ms >>> >>> After running a few tests, I discovered the major bottleneck in the >>> private >>> heap implementation: A tiny bit of debugging code which didn't do a whole >>> lot... except enumerate the entire heap on every allocation. >> >> Ah, I was wondering why you said that in tests it took 20 min to >> optimise the parsed XML when it only took me 12 seconds. >> >>> When I got rid of it, everything sped up enormously! >>> >>> With private heap: >>> Loading XML document... >>> Loaded XML document in 5139 ms >>> Starting memory usage: 301 >>> Parsing document... >>> Parsed XML document in 6765 ms >>> Memory usage: 237104215 >>> Optimised document in 11849 ms >>> Memory usage: 157105826 >>> Serialized document in 1138 ms >>> Tore down memory in 1974 ms >>> Finished in 28314 ms >>> >>> Without private heap: >>> Loading XML document... >>> Loaded XML document in 4974 ms >>> Starting memory usage: 0 >>> Parsing document... >>> Parsed XML document in 6989 ms >>> Memory usage: 0 >>> Optimised document in 12147 ms >>> Memory usage: 0 >>> Serialized document in 1207 ms >>> Tore down memory in 3257 ms >>> Finished in 28577 ms >> >> Interesting that in those runs it took less than half as long to >> serialise as on my machine, when everything else is near equal. >> Difference in C stdio implementation or kernel file system, I'd guess. > > Hmm, didn't notice that. Mildly strange! > >>> I ran each a few times, it seems that they're all about the same. the >>> only >>> big difference I can see is the "Tore down memory" phase, which is indeed >>> faster with the private heap (by 33%!) as intended. I suspect, though, >>> that >>> it's not enough to save it. Oh well, it was fun to get working, now it'll >>> be >>> fun to excise! >> >> Hang on, "Tore down memory" includes freeing the XML document too. The >> RELOAD doc teardown time could still be pretty significant. Try >> instead timing opening a .reld and closing it. > > Hmm, you're right. Really, the XML document should be freed as soon as > possible. I will investigate. > > That said, I've already pulled everything. I was just in the middle of > testing to make sure I didn't break anything. But, I'll try this before I > commit it. > >>> Anyway, the whole process used ~650 Megs of RAM, which is broken up as >>> such: >>> - Approximately 423 Megs used by libxml2 to load the XML document >>> - Exactly 226.12 Megs (as seen above) used by RELOAD to load the >>> unoptimized >>> RELOAD document >> >> That's bizarre; as I said, it took reload2reload 140MB to load the >> resulting RELOAD document for me. > > Yes, that's the optimized version: > > Optimised document in 11849 ms > Memory usage: 157105826 [ == 149.8 Megs]
Whoops, misread. >>> I am going to assume that half of the difference is by the nodes' name >>> strings alone. >>> >>> Anyway, mission accomplished, I guess. Now to find a 1Gig document, and >>> see >>> how long that takes :D >>> -- >>> Mike >> >> I would try it, but unfortunately libxml is obviously going to exhaust >> the 2GB address space if you load in the whole thing. xml2reload on >> that 64MB file requires 666MB of virtual memory here. > > Hmm... We need someone with a 64-bit processor. And, a 64-bit version of > libxml (which I assume, without looking, exists). I have those, but I do not have a 64-bit FreeBasic compiler, wrecking the experiment. > -- > Mike > _______________________________________________ > Ohrrpgce mailing list > [email protected] > http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org > _______________________________________________ Ohrrpgce mailing list [email protected] http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org
