Re: [PR] Add testingParseUnparseAPIMode Tunable [daffodil]

via GitHub Wed, 06 Nov 2024 09:05:06 -0800


stevedlawrence commented on code in PR #1365:
URL: https://github.com/apache/daffodil/pull/1365#discussion_r1830791041



##########
daffodil-propgen/src/main/resources/org/apache/daffodil/xsd/dafext.xsd:
##########
@@ -579,6 +579,15 @@
             </xs:documentation>
           </xs:annotation>
         </xs:element>
+        <xs:element name="testingParseUnparseAPIMode" 
type="daf:TunableTestingParseUnparseAPIMode" default="limited"  minOccurs="0">

Review Comment:
   I'm wondering if a tunable isn't the right thing for this?
   
   This will apimode flag is really only going to be used for regression 
testing, and when we do regression testing we want it enabled for all tests. We 
don't really want to have to modify config files or something for all tests to 
enable this, which is how tunables are generally set.
   
   I'm thinking we either want this to be a Java system property (e.g. 
`daffodil.tdml.apimode`) or an environment variable (e.g. 
`DAFFODIL_TDML_APIMODE`). And I think an environment variable has advantages 
over a Java system property. For one, using system properties is different 
depending on how you spawn the JVM. For example, the sbt and daffodil CLI 
specify system properties differently:
   
   ```bash
   sbt -Ddaffodil.tdml.apimode=full test 
   DAFFODIL_JAVA_OPTS="-Ddaffodil.tdml.apimode=full" daffodil test foo.tdml
   ```
   
   So with sbt you can use `-D` directly, but with the CLI you have to specify 
it in DAFFODIL_JAVA_OPTS. I also am not sure if SBT propagates system 
properties if it forks. So if it forks a process (like some of our integration 
tests do), I imagine it won't get the system property and the forked processes 
won't run with full apimode.
   
   But if we go with an environment variable instead, it's the same for both:
   
   ```bash
   export DAFFODIL_TDML_APIMODE=full
   sbt test
   daffodil test foo.tdml
   ```
   
   This is the same regardless if we are testing via sbt or via the daffodil 
CLI. And if something else ever runs the TDML runner (e.g. a test harness that 
directly calls the TDML runner), it's also the same.  I also believe if sbt or 
an integration test forks, it copies the existing environment variables, so 
forked processes should also enable full api mode.
   
   Also, for CI we can just add something like this to our GitHub actions 
config:
   ```yaml
   env:
     DAFFODIL_TDML_APIMODE=full
   ```
   And all CI will run with all APIs, which would be good for regression 
testing.
   
   



##########
daffodil-tdml-processor/src/main/scala/org/apache/daffodil/processor/tdml/DaffodilTDMLDFDLProcessor.scala:
##########
@@ -252,21 +250,47 @@ class DaffodilTDMLDFDLProcessor private (private var dp: 
DataProcessor)
     infosetXML: scala.xml.Node,
     outStream: java.io.OutputStream
   ): TDMLUnparseResult = {
-    val bos = new ByteArrayOutputStream()
-    val osw = new OutputStreamWriter(bos, StandardCharsets.UTF_8)
-    scala.xml.XML.write(osw, infosetXML, "UTF-8", xmlDecl = true, null)
-    osw.flush()
-    osw.close()
-    val saxInstream = new ByteArrayInputStream(bos.toByteArray)
-    doUnparseWithBothApis(inputter, saxInstream, outStream)
+    if (dp.tunables.testingParseUnparseAPIMode == Full) {
+      val bos = new ByteArrayOutputStream()
+      val osw = new OutputStreamWriter(bos, StandardCharsets.UTF_8)
+      scala.xml.XML.write(osw, infosetXML, "UTF-8", xmlDecl = true, null)
+      osw.flush()
+      osw.close()
+      val saxInstream = new ByteArrayInputStream(bos.toByteArray)
+      doUnparseWithBothApis(inputter, saxInstream, outStream)
+    } else {
+      doUnparseWithNonSaxApi(inputter, outStream)
+    }
+  }
+
+  def doParseWithNonSaxAPI(

Review Comment:
   Feels like there's some duplicate code in these doParseWith*API functions 
since they both implement the non-sax API. Can they be combined into one 
function, e.g.
   
   ```scala
   private def doParse(...) {
     // always do stuff with non-sax API
   
     if (apiMode = full) {
       // do sax API
       // verify SAX results match non-sax results
     }
   }
   ```
   Same with the unparse functions?



##########
daffodil-tdml-processor/src/main/scala/org/apache/daffodil/processor/tdml/DaffodilTDMLDFDLProcessor.scala:
##########
@@ -206,17 +207,14 @@ class DaffodilTDMLDFDLProcessor private (private var dp: 
DataProcessor)
   ): DaffodilTDMLDFDLProcessor =
     copy(dp = dp.withExternalVariables(externalVarBindings))
 
-  def parse(uri: java.net.URI, lengthLimitInBits: Long): TDMLParseResult = {
-    val url = uri.toURL
-    val dpInputStream = url.openStream()
-    val saxInputStream = url.openStream()
-    doParseWithBothApis(dpInputStream, saxInputStream, lengthLimitInBits)
-  }
-
   def parse(arr: Array[Byte], lengthLimitInBits: Long): TDMLParseResult = {
     val dpInputStream = new ByteArrayInputStream(arr)
-    val saxInputStream = new ByteArrayInputStream(arr)
-    doParseWithBothApis(dpInputStream, saxInputStream, lengthLimitInBits)
+    if (dp.tunables.testingParseUnparseAPIMode == Full) {
+      val saxInputStream = new ByteArrayInputStream(arr)
+      doParseWithBothApis(dpInputStream, saxInputStream, lengthLimitInBits)
+    } else {
+      doParseWithNonSaxAPI(dpInputStream, lengthLimitInBits)
+    }
   }
 
   override def parse(is: java.io.InputStream, lengthLimitInBits: Long): 
TDMLParseResult = {

Review Comment:
   Below this we call `IOUtils.toByteArray`, which reads the input stream into 
a byte array. But then it calls the above parse function which converts the 
byte array back to an input stream. We need to do that when parsing with both 
APIs since each API needs it's own input stream, but if we are only using the 
non-sax API we don't really need to do that.
   
   I'm wondering if it would be possible and not too messy to only do the byte 
array stuff iwe are are doing both APIs. Otherwise we just pass in the original 
input stream to doParseWithNonSaxAPI? Avoids unnecessarily duplicating the 
input stream. 



##########
daffodil-propgen/src/main/resources/org/apache/daffodil/xsd/dafext.xsd:
##########
@@ -749,6 +758,13 @@
     </xs:list>
   </xs:simpleType>
 
+  <xs:simpleType name="TunableTestingParseUnparseAPIMode">
+    <xs:restriction base="xs:token">
+      <xs:enumeration value="limited" />
+      <xs:enumeration value="full" />

Review Comment:
   I wonder if we should consider something other than limited/full. I think we 
determined those probably weren't the best names for our different kinds of 
validation, so we should learn from that.
   
   Maybe instead of "full", we do "all"? And maybe instead of "limited" we do 
"scala" (implying scala infoset?). If we ever wanted, this could then be 
extended to be a list of infosets/api's to use. For example, we could in theory 
support something like `DAFFODIL_TDML_APIMODE="scala json sax"`, which would 
only enable those three kinds of infosets instead of all of them. I'm not sure 
the extra complexity is wroth the flexibility for now, but it leaves the option 
open. scala/all is probably sufficient for now.
   
   Also, makes me wonder if the env variable wants to be something like 
`DAFFODIL_TDML_INFOSETS` or `DAFFODIL_TDML_API_INFOSETS`? Not sure I like 
those, but something that it expects an infoset type might be an improvement?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add testingParseUnparseAPIMode Tunable [daffodil]

Reply via email to