Re: [PR] Remove absolute URI from SchemaFileLocation to improve reproducibility [daffodil]

via GitHub Fri, 30 Aug 2024 06:58:53 -0700


stevedlawrence commented on PR #1286:
URL: https://github.com/apache/daffodil/pull/1286#issuecomment-2321350663


   After a number of runs, I was able to reproduce the issue of different saved 
parsers in GitHub twice (I'm still unable to do it locally) and download the 
differing saved parsers. I was able to dump the serialized processors to text 
(using 
[NickstaDB/SerializationDumper](https://github.com/NickstaDB/SerializationDumper))
 and do a diff, and found the only difference in both cases was this:
   ```diff
     delimiters
       (array)
   -     TC_ARRAY - 0x75
   -       TC_REFERENCE - 0x71
   -         Handle - 8258633 - 0x00 7e 04 49
   -       newHandle 0x00 7e 04 c6
   -       Array size - 0 - 0x00 00 00 00
   -       Values
   +     TC_REFERENCE - 0x71
   +       Handle - 8258634 - 0x00 7e 04 4a
   ```
   The `delimiters` member is the `Array[DelimiterParseEv]` in the 
`DelimiterStackParser`. I'm not 100% sure how to interpret this, but I think in 
the first case the delimier array is a uniquely allocated zero-length array, 
where the reference is to the type (i.e. the class description for 
`DelimiterParseEv)`. In the second case, it just points to an already existing 
array zero-length array. So in one case the zero-length array is shared, in the 
other it's not. So functionally exactly the same, just one has extra 
allocations.
   
   I wonder if maybe there is some optimization somewhere something detects 
zero-length arrays and can share them instead of allocating new arrays? And 
sometimes that happens and sometimes it doesn't, which leads to differences in 
serialization?
   
   I imagine we could maybe do something like manually detect when an array is 
zero length and make sure we use Array.empty() or some globally allocated 
zero-length array, but I imagine we're unlikely to ever find all cases so 
reproducibility will still be an issue.
   
   On a positive note, I have confirmed the difference has nothing to do with 
paths, so that issue preventing reproducibility is definitely fixed. But I 
wonder if we should disable this test to avoid random CI failures? Maybe in the 
future we can switch to some serialization format that is more reliably 
consistent?
   
   Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Remove absolute URI from SchemaFileLocation to improve reproducibility [daffodil]

Reply via email to