And here's where we do some logic and a more detailed comment about it:
https://github.com/apache/daffodil/blob/main/daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/parsers/PState.scala#L346-L362
So I think we do already do copy-on-write for variables when parsing.
On 2024-01-09 05:28 PM, Steve Lawrence wrote:
There's actually a comment in the PState captureFrom() method used to
capture state during PoUs:
// Note that this is intentionally a shallow copy. This normally would
// not work because the variable map is mutable so other state changes
// could mutate this snapshot. This is avoided by carefully changing the
// PState variable map to a deep copy of this variable map right before a
// change is made. This essentially makes the PState variable map behave
// as copy-on-write.
this.variableMap = ps.variableMap
Assuming that is all true and done correctly, we might actually already
do what you suggest, at least for variables. But there might be other
parts of PState that we that woudl improve performance by changing to
copy-on-write. We may want to do some profiling on formats with lots of
PoUs to see if anything shows up.
On 2024-01-09 04:03 PM, Mike Beckerle wrote:
Actually, I haven't measured it, but there are 4 built in variables, so
even if a schema introduces no new variables of its own there is overhead
to deal with copying the state of 4 variables just in case you need to
backtrack them, and this overhead occurs for every point of uncertainty.
Also more and more schemas are using variables. We're finding them very
very useful.
Nevertheless I think the vast bulk of points of uncertainty will come and
go with no variables being touched. They tend to get used for specific
things, but not all over the place.
For example, several schemas have a feature to capture bad data into a
hexBinary Blob element so as to be able to keep parsing a large file,
instead of failing on the first bad data item.
Whether they do this or just fail is controlled by a variable. But that
variable is not touched unless legal parsing fails. So one would hope the
vast bulk of the data processing would never touch that variable, yet
every
single record in the data file is a point of uncertainty.
On Tue, Jan 9, 2024 at 1:49 PM Larry Barber <larry.bar...@nteligen.com>
wrote:
Seems like the benefit would only be significant if you were dealing
with
lots of variables.
-----Original Message-----
From: Mike Beckerle <mbecke...@apache.org>
Sent: Tuesday, January 9, 2024 1:39 PM
To: dev@daffodil.apache.org
Subject: Thoughts on on demand copying of parser state
Right now we copy the state of the parser as every point of
uncertainty is
reached.
I am speculating that we could copy on demand. So, for example, if no
variable modifying operation occurs then there would be no overhead to
copy the variable state.
This comes at the cost of each variable doing an additional test of
whether the variable state needs to be copied first.
Thoughts?