Steve Lawrence created DAFFODIL-3065:
----------------------------------------
Summary: Unparse support for infoset prefetching--reduce
suspensions
Key: DAFFODIL-3065
URL: https://issues.apache.org/jira/browse/DAFFODIL-3065
Project: Daffodil
Issue Type: Improvement
Components: Performance, Unparsing
Reporter: Steve Lawrence
Fix For: 4.2.0
The current unparse is written with streaming in mind, but it does so pretty
conservatively, which can lead to potential performance slow downs.
At a high level, the way unparsing currentyl works is an individual unparser
asks for the next unparse event. If that event does not exist, we get the next
event from the infoset inputter, update the internal infoset accordingly, and
then return the associated event. This effectively reads only one event at a
time.
This approach minimizes memory usage since we only store a single event and the
infoset is no larger than what the unparser is currently working on. The
downside to this approach is it is not uncommon for an unparser to reference a
future part of the infoset that does not exist yet, such as accessing the value
of an element or asking how many children are in an array. When we need to
query part of the infoset that does not exist yet, we clone the UState and
create a Suspension that is evaluated later once the required infoset
element(s) exist. This clone and Suspension can create additional overhead.
Instead of creating the infoset one element at a time when an unparser needs
it, we should update the inputter logic so it can read a large number of
infoset events and build a larger section of the infoset, withing some tunable
limit. Essentially allowing something like infoset prefetching. This tunable
limit could default to a reasonably large value such that the infoset could be
entirely built prior to any unparsing for small infosets--this could eliminate
a large number of suspensions and speed up unparsing.
Note that part of the current logic allocates InfosetAccessor events and adds
them to a queue for each event. This queue is only 2 items large so takes very
little memory. But if we did not change anything else, we would likely need to
increase this queue size so that it could store all the events for all parts of
the currently built infoset. However, this could require a significant amount
of memory.
So an additional change to avoid this memory usage could be similar to how the
parse InfosetWalker works. Instead of maintaining a large queue of events, we
could maintain a single bit of state containing a pointer to the current
infoset element and what the current event is (i.e.
startElement/endElement/startArray/endArray). The various advance/inspect
functions would then read this current state or update the state to the next
element/event. So instead of allocating and maintaining a queue of accessor
events, we just query what the unparse infoset walker currently looks at. Note
that we can still maintain the Cursor API, it would just iterate over and query
the actual infoset instead of storing a queue of events--essentially the
infoset becomes its queue.
With these changes, we can potentially avoid a large number of suspensions
without a significant change in memory usage, aside from the additionally
preffectch infoset.
Note that this approach does not remove suspensions entirely. For example, some
unparsers need to know the content length of a field. This requires the
relevant element to actually be unparsed, not just the element to exist in the
infoset, so this would still need a suspension. This also will not eliminate
suspensions if an unparser accesses a field beyond the prefetch limit. So all
the suspension logic must still exist as is, the goal is to simply minimize how
often they are needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)