stevedlawrence commented on code in PR #1365:
URL: https://github.com/apache/daffodil/pull/1365#discussion_r1830791041
##########
daffodil-propgen/src/main/resources/org/apache/daffodil/xsd/dafext.xsd:
##########
@@ -579,6 +579,15 @@
</xs:documentation>
</xs:annotation>
</xs:element>
+ <xs:element name="testingParseUnparseAPIMode"
type="daf:TunableTestingParseUnparseAPIMode" default="limited" minOccurs="0">
Review Comment:
I'm wondering if a tunable isn't the right thing for this?
This will apimode flag is really only going to be used for regression
testing, and when we do regression testing we want it enabled for all tests. We
don't really want to have to modify config files or something for all tests to
enable this, which is how tunables are generally set.
I'm thinking we either want this to be a Java system property (e.g.
`daffodil.tdml.apimode`) or an environment variable (e.g.
`DAFFODIL_TDML_APIMODE`). And I think an environment variable has advantages
over a Java system property. For one, using system properties is different
depending on how you spawn the JVM. For example, the sbt and daffodil CLI
specify system properties differently:
```bash
sbt -Ddaffodil.tdml.apimode=full test
DAFFODIL_JAVA_OPTS="-Ddaffodil.tdml.apimode=full" daffodil test foo.tdml
```
So with sbt you can use `-D` directly, but with the CLI you have to specify
it in DAFFODIL_JAVA_OPTS. I also am not sure if SBT propagates system
properties if it forks. So if it forks a process (like some of our integration
tests do), I imagine it won't get the system property and the forked processes
won't run with full apimode.
But if we go with an environment variable instead, it's the same for both:
```bash
export DAFFODIL_TDML_APIMODE=full
sbt test
daffodil test foo.tdml
```
This is the same regardless if we are testing via sbt or via the daffodil
CLI. And if something else ever runs the TDML runner (e.g. a test harness that
directly calls the TDML runner), it's also the same. I also believe if sbt or
an integration test forks, it copies the existing environment variables, so
forked processes should also enable full api mode.
Also, for CI we can just add something like this to our GitHub actions
config:
```yaml
env:
DAFFODIL_TDML_APIMODE=full
```
And all CI will run with all APIs, which would be good for regression
testing.
##########
daffodil-tdml-processor/src/main/scala/org/apache/daffodil/processor/tdml/DaffodilTDMLDFDLProcessor.scala:
##########
@@ -252,21 +250,47 @@ class DaffodilTDMLDFDLProcessor private (private var dp:
DataProcessor)
infosetXML: scala.xml.Node,
outStream: java.io.OutputStream
): TDMLUnparseResult = {
- val bos = new ByteArrayOutputStream()
- val osw = new OutputStreamWriter(bos, StandardCharsets.UTF_8)
- scala.xml.XML.write(osw, infosetXML, "UTF-8", xmlDecl = true, null)
- osw.flush()
- osw.close()
- val saxInstream = new ByteArrayInputStream(bos.toByteArray)
- doUnparseWithBothApis(inputter, saxInstream, outStream)
+ if (dp.tunables.testingParseUnparseAPIMode == Full) {
+ val bos = new ByteArrayOutputStream()
+ val osw = new OutputStreamWriter(bos, StandardCharsets.UTF_8)
+ scala.xml.XML.write(osw, infosetXML, "UTF-8", xmlDecl = true, null)
+ osw.flush()
+ osw.close()
+ val saxInstream = new ByteArrayInputStream(bos.toByteArray)
+ doUnparseWithBothApis(inputter, saxInstream, outStream)
+ } else {
+ doUnparseWithNonSaxApi(inputter, outStream)
+ }
+ }
+
+ def doParseWithNonSaxAPI(
Review Comment:
Feels like there's some duplicate code in these doParseWith*API functions
since they both implement the non-sax API. Can they be combined into one
function, e.g.
```scala
private def doParse(...) {
// always do stuff with non-sax API
if (apiMode = full) {
// do sax API
// verify SAX results match non-sax results
}
}
```
Same with the unparse functions?
##########
daffodil-tdml-processor/src/main/scala/org/apache/daffodil/processor/tdml/DaffodilTDMLDFDLProcessor.scala:
##########
@@ -206,17 +207,14 @@ class DaffodilTDMLDFDLProcessor private (private var dp:
DataProcessor)
): DaffodilTDMLDFDLProcessor =
copy(dp = dp.withExternalVariables(externalVarBindings))
- def parse(uri: java.net.URI, lengthLimitInBits: Long): TDMLParseResult = {
- val url = uri.toURL
- val dpInputStream = url.openStream()
- val saxInputStream = url.openStream()
- doParseWithBothApis(dpInputStream, saxInputStream, lengthLimitInBits)
- }
-
def parse(arr: Array[Byte], lengthLimitInBits: Long): TDMLParseResult = {
val dpInputStream = new ByteArrayInputStream(arr)
- val saxInputStream = new ByteArrayInputStream(arr)
- doParseWithBothApis(dpInputStream, saxInputStream, lengthLimitInBits)
+ if (dp.tunables.testingParseUnparseAPIMode == Full) {
+ val saxInputStream = new ByteArrayInputStream(arr)
+ doParseWithBothApis(dpInputStream, saxInputStream, lengthLimitInBits)
+ } else {
+ doParseWithNonSaxAPI(dpInputStream, lengthLimitInBits)
+ }
}
override def parse(is: java.io.InputStream, lengthLimitInBits: Long):
TDMLParseResult = {
Review Comment:
Below this we call `IOUtils.toByteArray`, which reads the input stream into
a byte array. But then it calls the above parse function which converts the
byte array back to an input stream. We need to do that when parsing with both
APIs since each API needs it's own input stream, but if we are only using the
non-sax API we don't really need to do that.
I'm wondering if it would be possible and not too messy to only do the byte
array stuff iwe are are doing both APIs. Otherwise we just pass in the original
input stream to doParseWithNonSaxAPI? Avoids unnecessarily duplicating the
input stream.
##########
daffodil-propgen/src/main/resources/org/apache/daffodil/xsd/dafext.xsd:
##########
@@ -749,6 +758,13 @@
</xs:list>
</xs:simpleType>
+ <xs:simpleType name="TunableTestingParseUnparseAPIMode">
+ <xs:restriction base="xs:token">
+ <xs:enumeration value="limited" />
+ <xs:enumeration value="full" />
Review Comment:
I wonder if we should consider something other than limited/full. I think we
determined those probably weren't the best names for our different kinds of
validation, so we should learn from that.
Maybe instead of "full", we do "all"? And maybe instead of "limited" we do
"scala" (implying scala infoset?). If we ever wanted, this could then be
extended to be a list of infosets/api's to use. For example, we could in theory
support something like `DAFFODIL_TDML_APIMODE="scala json sax"`, which would
only enable those three kinds of infosets instead of all of them. I'm not sure
the extra complexity is wroth the flexibility for now, but it leaves the option
open. scala/all is probably sufficient for now.
Also, makes me wonder if the env variable wants to be something like
`DAFFODIL_TDML_INFOSETS` or `DAFFODIL_TDML_API_INFOSETS`? Not sure I like
those, but something that it expects an infoset type might be an improvement?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]