stevedlawrence commented on pull request #674: URL: https://github.com/apache/daffodil/pull/674#issuecomment-969063293
I did some testing on three different approaches with the PCAP schema, which I think is good representation of fairly complex schema with a decent amount of PoUs and some variables stuff. Note that this version does not use any of the layering stuff, which does rely on variables quite a bit more. The three approaches were: 1. Deep copy the variable map every time we enter a PoU 2. Deep copy the variable map the first time a variable is modified inside a PoU 3. Copy instance information the first time a variable instance is modified inside PoU Each one is a bit more complex than the previous. The first one is simple to implement, but potentially adds a lot of overhead. The second one is a bit more complex, but only has overhead if a variable is modified. But if the variable map grows large, the deep copy might be noticable. The last one is yet again more complex, but minimizes copying if there are lots of variables or new variable instances. The numbers below are for parsing 20,000 PCAP files. The null infoset outputter was used to avoid the overhead of writing the infoset. The three stats gathered are total time (smaller is better) and maximum and average rate in files/second (larger is better). The numbers are also compared against the current 3.1.0 release, with percent change in parenthesis. | |3.1.0 |Deep Copy Every PoU|Deep Copy On Write|Copy Instance On Write| |:--- | ---: | ---: | ---: | ---: | |Total time (seconds) |39.97 |41.42 (3.62%) |36.59 (-8.45%) |36.55 (-8.55%) | |Max rate (files/second) |637.72 |570.15 -10.60%) |628.87 (-1.39%) |639.01 (0.20%) | |Avg rage (files/second) |500.40 |483.79 (-3.32%) |546.65 (9.24%) |547.21 (9.35%) | So deep copy on write and copy instance on write are virtually the same, with maybe a slight advantage to the copy instance on write, but the differences could easily be in the noise of JVM. Deep copying every PoU is definitely a bad idea. Also, note that I think copy on write's are faster than 3.1.0 because we currently allocate an empty `Map` every time we enter a PoU just in case we have to do variable tracking stuff. In the copy on write changes, it only creates that `Map` if variable change. Creating an Empty map can be expensive, if if it's never used. So I think this is a clear win for the copy-on-write stuff. Whether or not we add extra complexity to handle per-instance copy on write or entire map copy-on-write doesn't seem to matter, at least not with this format. I think if a schema had lots of variables or lots of nested new variable instances, then maybe the variable map could grow pretty big and make the deep copy expensive, but I'm not sure how likely that is. I think I'm leaning towards the deep copy on write. It's definitely simpler than the copy instance on write, and seems to have comparable performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
