Easier design point for streaming unparsing ?

Mike Beckerle Tue, 02 Oct 2018 14:00:02 -0700

This memo is a contribution to an overall design note/discussion needed for 
true SAX-style event streaming for large data streams of small items.



The unparser doesn't "stream" as yet, because all suspensions are queued until 
the end of all unparsing.


For streaming to truly work, we must trim the infoset of nodes no longer needed 
that are already unparsed, and we must evaluate suspended expressions soon 
after the data they require becomes available.


While not written up in a design note, the  prior design point being considered 
was enhancing the infoset so that suspensions could be hung onto the infoset 
node that is blocking them, and they would then be re-activated only when that 
infoset node's state changes.


This, however, is a lot of complexity for possibly small gain.


An alternative design point is possible based on evaluating suspensions based 
on the same times that we determine we can detach infoset nodes that are 
trimmed because they are no longer needed.


To achieve streaming behavior when parsing or unparsing, one must trim infoset 
nodes no longer needed from the infoset.


A heuristic is that when you are done with Array element N, then you can delete 
array element N-1. The assumption here is that expressions aren't going to 
refer back to element N-2, but only to N-1. Similarly, forward-referencing 
expressions when unparsing are going to look forward at most from element N to 
N+1, not beyond that to N+2.


If the suspensions are queued on the nearest enclosing array element's infoset 
node, then when that array index is advanced to position N+1, then position 
N-1's suspensions can be evaluated and it is very likely that they will 
complete. Those suspensions which do not complete would be moved upward to the 
next enclosing array element (eventually to the root element - where they are 
evaluated after the end of the unparser completes as in the current (v2.2.0 
i.e., 2018-10-02) version.


After evaluating the expressions for position N-1, then (if desired) the 
infoset node can also be dropped.


This dropping/pruning of infoset nodes requires a new state for infoset node 
references that indicates that it was present, but has been dropped. This state 
would be used to issue a proper diagnostic if the element is subsequently 
referenced.


This algorithm is conservative in that any suspensions that cannot be evaluated 
are pushed up, eventually all the way to the root element where they can then 
illustrate the deadlock, as the root element's suspensions work exactly as they 
do now - all are evaluated repeatedly until all are complete or the list is no 
longer changing and a deadlock exists.


A variation is to simply mark the infoset nodes that can be dropped as 
"droppable", but to leave the actual dropping to the consuming application 
(this is for parsing), that way the consuming application can discard as soon 
as possible what it doesn't need, while at the same time the consuming 
application can keep around as much of the infoset as it wants. A 
callback-style API is also possible, where nodes are dropped by a call to an 
API-registered callback object method. This method can then do whatever the 
application requires.


Also: On LIFO/FIFO queues.


The queues used for suspensions are LIFO, which is perhaps not as helpful as a 
FIFO order would be.


What we want is a LIFO/FIFO combination. When we pull items off, we want the 
FIFO order. When that item must be re-inserted because it was blocked still, we 
want to add it to the "end" so that it will not be reconsidered until every 
other element has been attempted. This combination LIFO/FIFO behavior enables 
short-lived suspensions to be evaluated typically once, not repeatedly 
attempted until they eventually succeed.

Easier design point for streaming unparsing ?

Reply via email to