[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #908: Refactor and document SAX unparse implementation

GitBox Thu, 12 Jan 2023 08:57:57 -0800


stevedlawrence commented on code in PR #908:
URL: https://github.com/apache/daffodil/pull/908#discussion_r1068371163



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/DaffodilUnparseContentHandler.scala:
##########
@@ -28,156 +28,224 @@ import org.xml.sax.Locator
 import org.apache.daffodil.api.DFDL
 import org.apache.daffodil.api.DFDL.DaffodilUnhandledSAXException
 import org.apache.daffodil.api.DFDL.DaffodilUnparseErrorSAXException
-import org.apache.daffodil.api.DFDL.SAXInfosetEvent
 import org.apache.daffodil.exceptions.Assert
 import org.apache.daffodil.infoset.InfosetInputterEventType.EndDocument
 import org.apache.daffodil.infoset.InfosetInputterEventType.EndElement
 import org.apache.daffodil.infoset.InfosetInputterEventType.StartDocument
 import org.apache.daffodil.infoset.InfosetInputterEventType.StartElement
+import org.apache.daffodil.infoset.SAXInfosetEvent
 import org.apache.daffodil.infoset.SAXInfosetInputter
 import org.apache.daffodil.util.MStackOf
+import org.apache.daffodil.util.MainCoroutine
 import org.apache.daffodil.util.Maybe
 import org.apache.daffodil.util.Maybe.Nope
 import org.apache.daffodil.util.Maybe.One
 import org.apache.daffodil.util.Misc
 
 /**
- * DaffodilUnparseContentHandler produces SAXInfosetEvent objects for the 
SAXInfosetInputter to
- * consume and convert to events that the Dataprocessor unparse can use. The 
SAXInfosetEvent object
- * is built from information that is passed to the ContentHandler from an 
XMLReader parser. In
- * order to receive the uri and prefix information from the XMLReader, the 
XMLReader must have
- * support for XML Namespaces
+ * Handle and unparse XMLReader SAX events using a provided DataProcessor and
+ * OutputChannel
  *
- * This class, together with the SAXInfosetInputter, uses coroutines to ensure 
that a batch of events
- * (based on the tunable saxUnparseEventBatchSize) can be passed from the 
former to the latter.
- * The following is the general process:
+ * Note: XMLReaders using this as their ContentHandler must have support for 
XML
+ * namespaces so that we provided namespace URI and prefix information that 
Daffodil
+ * requires to unparse.
  *
- * - an external call is made to parse an XML Document
- * - this class receives a StartDocument call, which is the first 
SAXInfosetEvent that should be
- * sent to the SAXInfosetInputter. That event is put onto an array of 
SAXInfosetEvents of size the
- * saxUnparseEventBatchSize tunable. Once the array is full, it is put on the 
inputter's queue,
- * this thread is paused, and that inputter's thread is run
- * - when the SAXInfosetInputter is done processing that batch and is ready 
for a new batch, it
- * sends a 1 element array with the last completed event via the coroutine 
system, which loads it on
- * the contentHandler's queue, which restarts this thread and pauses that one. 
In the expected case,
- * the single element array will contain no new information until the unparse 
complete. In the case of
- * an unexpected error though, it will contain error information
- * - this process continues until the EndDocument SAXInfosetEvent is loaded 
into the batch.
- * Once that SAXInfosetEvent is processed by the SAXInfosetInputter, it 
signals the end of batched
- * events coming from the contentHandler. This ends the unparseProcess and 
returns the event with
- * the unparseResult and/or any error
- * information
+ * The SAX ContentHandler API is push-based, but the Daffodil InfosetInputter 
unparse
+ * API is pull-based, so these two API's are at odds with one another. To link 
the
+ * two, we create two classes that implement a coroutine-like API to 
communicate and
+ * ensure that the push and pull sides of the two APIs never run at the same 
time
+ * (see Coroutine.scala for implementation details). The main coroutine or 
"event
+ * queuer" is this DaffodilUnparseContentHandler and runs on the same thread 
as an
+ * XMLReader to handle and batch SAX events. The worker coroutine or "event 
puller" is
+ * an instance of the SAXInfosetInputter which calls the actual unparse() 
function to
+ * query these batched SAX events and unparse data.
  *
- * @param dp dataprocessor object that will be used to call the parse
- * @param output outputChannel of choice where the unparsed data is stored
+ * Below is a description of how this class is used and they communicate to 
unparse
+ * using the SAX XMLReader API:
+ *
+ * 1. A DaffodilUnparseContentHandler instance is created, which initializes a
+ *    SAXInfosetInputter instance.
+ *
+ * 2. An XMLReader instance is created and configured to use the
+ *    DaffodilUnparseContentHandler to handle its SAX events.
+ *
+ * 3. The DaffodilUnparseContentHandler handles events from the XMLReader, 
gathers
+ *    the necessary information from those events, and fills out an array of
+ *    SAXInfosetEvent objects, called a "batch".
+ *
+ * 4. When a full batch has been gathered, or an endDocument() event is 
handled, the
+ *    DaffodilUnparseContentHandler calls resume() to pause execution and 
start the
+ *    SAXInfosetInputter coroutine, sending it the batch of events.
+ *
+ * 5. Since this is the first time the SAXInfosetInputter has been given 
control, its
+ *    coroutine Thread is started and its run() function called. This calls
+ *    waitResume() to receive the first batch of events. It then calls the
+ *    DataProcessor unparse() function to begin the unparse.
+ *
+ * 6. The SAXInfosetInputter functions are called by the unparse, which reads 
the
+ *    batched events and provides the necessary information to unparse the 
infoset.
+ *    Once the SAXInfosetInputter has unparsed all batched events and needs a 
new
+ *    event in the next() function, it calls resume() to pause execution and 
resume
+ *    the DaffodilUnparseContentHandler coroutine, sending it Nope to signify 
it
+ *    needs more events.
+ *
+ * 7. The DaffodilUnparseContentHandler resumes and again continues to handle 
SAX
+ *    events and fill out the batched events array until it is full or 
endDocument()
+ *    is handled, at which point it calls resume() to pause and send the new 
batch of
+ *    events to the SAXInfosetInputter where it resumes control.
+ *
+ * 8. Steps 6 and 7 repeat until the SAXInfosetInputter signals back to the
+ *    DaffodilUnparseContentHandler that it is complete calling resumeFinal() 
and
+ *    sending a One object containing either an UnparseResult or an Exception. 
At
+ *    this point, the SAXInfosetInputter is complete and provide control back 
to the
+ *    DaffodilUnparseContentHandler--the SAXInfosetInputter coroutine is done.
+ *
+ * 9. The DaffodilUnparseContentHandler resumes, recieves and examines the 
result
+ *    from the SAXInfosetInputter, and either makes the UnparseResult 
available to
+ *    the XMLReader or throws a SAXException if there was an error.
+ *
+ * @param dp DataProcessor object that will be used to start the unparse
+ * @param output OutputChannel of where the unparsed data is written
  */
 class DaffodilUnparseContentHandler(
   dp: DFDL.DataProcessor,
   output: DFDL.Output)
-  extends DFDL.DaffodilUnparseContentHandler {
-  private lazy val inputter = new SAXInfosetInputter(this, dp, output)
-  private var unparseResult: DFDL.UnparseResult = _
+  extends MainCoroutine[Maybe[Either[Exception, DFDL.UnparseResult]]]
+  with DFDL.DaffodilUnparseContentHandler {

Review Comment:
   Yeah, I didn't know this either until I started digging more deeply into 
this SAX an coroutine stuff. I'll make sure it's clear somewhere (maybe the 
Coroutine.scala documentation) that the type parameter is the type the 
coroutine expects to receive from its peer when it calls resume, which are 
allowed to be different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #908: Refactor and document SAX unparse implementation

Reply via email to