stevedlawrence commented on PR #1286: URL: https://github.com/apache/daffodil/pull/1286#issuecomment-2321350663
After a number of runs, I was able to reproduce the issue of different saved parsers in GitHub twice (I'm still unable to do it locally) and download the differing saved parsers. I was able to dump the serialized processors to text (using [NickstaDB/SerializationDumper](https://github.com/NickstaDB/SerializationDumper)) and do a diff, and found the only difference in both cases was this: ```diff delimiters (array) - TC_ARRAY - 0x75 - TC_REFERENCE - 0x71 - Handle - 8258633 - 0x00 7e 04 49 - newHandle 0x00 7e 04 c6 - Array size - 0 - 0x00 00 00 00 - Values + TC_REFERENCE - 0x71 + Handle - 8258634 - 0x00 7e 04 4a ``` The `delimiters` member is the `Array[DelimiterParseEv]` in the `DelimiterStackParser`. I'm not 100% sure how to interpret this, but I think in the first case the delimier array is a uniquely allocated zero-length array, where the reference is to the type (i.e. the class description for `DelimiterParseEv)`. In the second case, it just points to an already existing array zero-length array. So in one case the zero-length array is shared, in the other it's not. So functionally exactly the same, just one has extra allocations. I wonder if maybe there is some optimization somewhere something detects zero-length arrays and can share them instead of allocating new arrays? And sometimes that happens and sometimes it doesn't, which leads to differences in serialization? I imagine we could maybe do something like manually detect when an array is zero length and make sure we use Array.empty() or some globally allocated zero-length array, but I imagine we're unlikely to ever find all cases so reproducibility will still be an issue. On a positive note, I have confirmed the difference has nothing to do with paths, so that issue preventing reproducibility is definitely fixed. But I wonder if we should disable this test to avoid random CI failures? Maybe in the future we can switch to some serialization format that is more reliably consistent? Any thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
