[GitHub] [daffodil] stevedlawrence commented on pull request #674: Work around issue resetting predefined vars

GitBox Mon, 15 Nov 2021 08:07:56 -0800


stevedlawrence commented on pull request #674:
URL: https://github.com/apache/daffodil/pull/674#issuecomment-969063293



   I did some testing on three different approaches with the PCAP schema, which 
I think is good representation of fairly complex schema with a decent amount of 
PoUs and some variables stuff. Note that this version does not use any of the 
layering stuff, which does rely on variables quite a bit more.
   
   The three approaches were:
   1. Deep copy the variable map every time we enter a PoU
   2. Deep copy the variable map the first time a variable is modified inside a 
PoU
   3. Copy instance information the first time a variable instance is modified 
inside  PoU
   
   Each one is a bit more complex than the previous. The first one is simple to 
implement, but potentially adds a lot of overhead. The second one is a bit more 
complex, but only has overhead if a variable is modified. But if the variable 
map grows large, the deep copy might be noticable. The last one is yet again 
more complex, but minimizes copying if there are lots of variables or new 
variable instances.
   
   The numbers below are for parsing 20,000 PCAP files. The null infoset 
outputter was used to avoid the overhead of writing the infoset.
   
   The three stats gathered are total time (smaller is better) and maximum and 
average rate in files/second (larger is better). The numbers are also compared 
against the current 3.1.0 release, with percent change in parenthesis.
   
   |                        |3.1.0    |Deep Copy Every PoU|Deep Copy On 
Write|Copy Instance On Write|
   |:---                    | ---:    | ---:              | ---:             | 
---:                 |
   |Total time (seconds)     |39.97    |41.42 (3.62%)      |36.59 (-8.45%)    
|36.55 (-8.55%)        |
   |Max rate (files/second) |637.72   |570.15 -10.60%)    |628.87 (-1.39%)   
|639.01 (0.20%)        |
   |Avg rage (files/second) |500.40   |483.79 (-3.32%)    |546.65 (9.24%)    
|547.21 (9.35%)        |
   
   So deep copy on write and copy instance on write are virtually the same, 
with maybe a slight advantage to the copy instance on write, but the 
differences could easily be in the noise of JVM. Deep copying every PoU is 
definitely a bad idea.
   
   Also, note that I think copy on write's are faster than 3.1.0 because we 
currently allocate an empty `Map` every time we enter a PoU just in case we 
have to do variable tracking stuff. In the copy on write changes, it only 
creates that `Map` if variable change. Creating an Empty map can be expensive, 
if if it's never used.
   
   So I think this is a clear win for the copy-on-write stuff. Whether or not 
we add extra complexity to handle per-instance copy on write or entire map 
copy-on-write doesn't seem to matter, at least not with this format. I think if 
a schema had lots of variables or lots of nested new variable instances, then 
maybe the variable map could grow pretty big and make the deep copy expensive, 
but I'm not sure how likely that is.
   
   I think I'm leaning towards the deep copy on write. It's definitely simpler 
than the copy instance on write, and seems to have comparable performance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [daffodil] stevedlawrence commented on pull request #674: Work around issue resetting predefined vars

Reply via email to