[GitHub] [incubator-daffodil] stevedlawrence commented on a change in pull request #394: Feature Addition: API for traversing DSOM and generating NIFI RecordSchemas from DSOM

GitBox Wed, 15 Jul 2020 06:14:21 -0700


stevedlawrence commented on a change in pull request #394:
URL: https://github.com/apache/incubator-daffodil/pull/394#discussion_r455040538




##########
File path: 
daffodil-core/src/test/scala/org/apache/daffodil/dsom/walker/TestWalker.scala
##########
@@ -0,0 +1,335 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.dsom.walker
+
+import scala.collection.mutable.ListBuffer
+
+/**
+ * This represents all possible types a Field could be; the View Mixin is 
pattern matched to one
+ * of these in TestNode
+ */
+object FieldType extends Enumeration {
+  type FieldType = Value
+  val RecordType, ChoiceType, ArrayType, DateTimeType, DateType, TimeType, 
BooleanType, ByteType,
+  ShortType, IntType, LongType, BigIntType, FloatType, DoubleType, StringType 
= Value
+}
+
+import org.apache.daffodil.dsom.walker.FieldType._
+
+/**
+ * Represents the DataType of a TestField; note that all of the Simple types 
manifest as
+ * a SimpleDataType with a single FieldType containing the FieldType enum for 
the simple.
+ * @param fieldType the particular FieldType Enum that this DataType backs.
+ */
+class DataType(val fieldType: FieldType.Value)
+case class RecordDataType(var childSchema: TestSchema) extends 
DataType(RecordType)
+case class ChoiceDataType(var possibleSubTypes: List[DataType]) extends 
DataType(ChoiceType)
+case class ArrayDataType(var elementType: DataType) extends DataType(ArrayType)
+case class SimpleDataType(simpleType: FieldType) extends DataType(simpleType)
+
+/**
+ * A field within a TestSchema
+ * @param fieldName name of the field
+ * @param dataType dataType of the field
+ */
+class TestField(val fieldName: String, val dataType: DataType)
+
+/**
+ * Base Schema object for this test class.  The only part that is really used 
is the field list;
+ * there are some helper methods based around the list
+ * @param name the name for this TestSchema
+ * @param namespace the target namespace of this TestSchema
+ * @param fields the list of fields that define this TestSchema
+ */
+class TestSchema(val name: String, val namespace: String, val fields: 
List[TestField]) {
+  def getField(name: String): Option[TestField] =
+    fields.filter(field => field.fieldName == name) match {
+      case List(result) => Some(result)
+      case _ => None
+    }
+  val fieldCount: Int = fields.size
+  val fieldNames: List[String] = fields.map(field => field.fieldName)
+}
+
+
+/**
+ * @param fieldName the name for the RecordField
+ * @param dataType the DataType of this RecordField
+ */
+class OptionalTestField(fieldName: String, dataType: DataType)
+  extends TestField(fieldName, dataType) {
+  // this is the easiest way to construct it; essentially a "copy constructor" 
except to denote
+  // that we want to treat it as this special optional subclass.
+  def this(recordField: TestField) = this(recordField.fieldName, 
recordField.dataType)
+}
+
+/**
+ * Concrete implementation of the AbstractDSOMWalker abstract class.
+ * This class produces a TestSchema that is intended to match the original 
DFDL file.
+ *
+ * The TestSchema is built in 3 primary stages:
+ * 1) A tree of TestNodes is created as the DFDL file is walked; this walk is 
performed
+ * through the various event handlers defined in the parent abstract class.
+ * 2) The tree of TestNodes undergoes some post-processing, mainly to remove 
redundant Record wrappers.
+ * 3) The tree of TestNodes is converted into a TestSchema; it is walked 
recursively within this class.
+ */
+class TestWalker extends AbstractDSOMWalker {
+
+  // this is the critical data structure for managing the temporary TestNodes 
that are created
+  // when the Schema is initially walked.  This will then be converted to the 
actual TestSchema.
+  private var objectStack: List[TestNode] = List()
+
+  // this will contain the final TestSchema after step 3 as described above is 
complete.
+  private var result: Option[TestSchema] = None
+  def getResult: TestSchema = result.orNull
+  def stringRep: String = if (result.isDefined) result.get.toString else ""
+
+  /**
+   * Every event handler for onXBegin is routed to this helper method.  
Individual properties of
+   * a given element are not actually handled here; these properties are 
extracted when a
+   * TestNode is instantiated
+   * @param dsomElement the particular element that was encountered.  Could be 
essentially any Term element.
+   */
+  private def onNodeBegin(dsomElement: TermView): Unit = {
+    val newNode: TestNode = new TestNode(dsomElement)
+    // we need add the new node as a new child of whatever is currently at the 
top of the stack
+    objectStack.head.addChild(newNode)
+    // however, we also add the node to the stack itself!  We need to be able 
to add children to it
+    // if it is, say, another record or array.
+    objectStack = newNode +: objectStack
+  }
+
+  /**
+   * Every event handler for onXEnd is routed to this helper method.  We don't 
need to do anything other
+   * than remove the top element of the stack; this should be the current 
element.
+   */
+  private def onNodeEnd(): Unit = objectStack = objectStack.tail
+
+  override def onTermBegin(termElement: TermView): Unit = termElement match {
+    case _: SequenceView | _: ChoiceView | _: ElementBaseView => 
onNodeBegin(termElement)
+    case _ =>
+  }
+
+  override def onTermEnd(termElement: TermView): Unit = termElement match {
+    case _: SequenceView | _: ChoiceView | _: ElementBaseView => onNodeEnd()
+    case _ =>
+  }
+
+  override def onTypeBegin(typeElement: TypeView): Unit = {}
+
+  override def onTypeEnd(typeElement: TypeView): Unit = {}
+
+  override def onWalkBegin(root: RootView): Unit = objectStack = List(new 
TestNode(root))
+
+  override def onWalkEnd(root: RootView): Unit = {
+    // After the walk is over, we perform postProcessing and then convert the 
TestNode tree
+    // into a TestSchema.  Also, if we are in dev. mode, we print out the 
TestNode tree
+    if (!RecordWalker.PRODUCTION_MODE) println(objectStack.head)
+    postProcessing()
+    if (!RecordWalker.PRODUCTION_MODE) println(objectStack.head)
+    result = Some(getRecordSchema(objectStack.head))
+  }
+
+  /**
+   * Determines if a Record is "extra"; that is, if it should be replaced with
+   * its list of children SchemaNodes within whatever parent TestNode it's a 
part of
+   * @param childNode the node to be considered
+   * @return true if this node is a Record and has no name, false otherwise
+   */
+  private def isExtraRecord(parentNode: TestNode, childNode: TestNode): 
Boolean = {
+    // any no-name nodes with no children are immediately removed
+    (childNode.name.isEmpty && childNode.children.isEmpty) || {
+      parentNode.fieldType match {
+        case RecordType | ArrayType =>
+          childNode.fieldType match {
+            // This removes extra wrapper records around children of records 
or arrays
+            // usually used to remove things like the DFDL complexType, 
simpleType elements
+            case RecordType => childNode.name.isEmpty
+            case _ => false
+          }
+        // Currently, all double choices are removed.  This was mainly done to 
make GroupRefs work
+        // for the JPEG Schema, but may not be the correct approach for all 
cases.
+        case ChoiceType =>
+          childNode.fieldType match {
+            case ChoiceType => true
+            case _ => false
+          }
+        case _ => false
+      }
+    }
+
+  }
+
+  /**
+   * Recursively replace any SchemaNodes that are of type record and do not 
have
+   * a name attribute with their children. These usually represent unnecessary 
wrapper nodes or empty
+   * records with no elements.
+   *
+   * Given a TestNode, we copy over all the children to a list, and initialize 
another empty list.
+   * Then, for every Node in the first list, if it is considered "extra",
+   * then its *children* are added to the second list.
+   * Otherwise, the Node itself is added.
+   *
+   * Then, we let the first list equal the second and we repeat this process
+   * until none of the children are "extra".
+   * This is necessary because sometimes we can have towers of extra Nodes 
that would
+   * never get resolved if we just took care of 1 or 2 layers; all must be 
dealt with at once.
+   *
+   * Then, the current TestNode's children list is replaced with this filtered 
one,
+   * and this method is called on all the items of that list.
+   * @param schemaNode the current Node undergoing the algorithm described 
above
+   */
+  private def removeExtraRecords(schemaNode: TestNode): Unit = {
+    schemaNode.fieldType match {
+      // Only SchemaNodes that are these 3 types should be able to have child 
nodes.
+      case RecordType | ChoiceType | ArrayType =>
+        if (schemaNode.children.exists(child => isExtraRecord(schemaNode, 
child))) {
+          var currChildren: ListBuffer[TestNode] = new ListBuffer[TestNode]()
+          for (child <- schemaNode.children) currChildren += child
+          while (currChildren.exists(child => isExtraRecord(schemaNode, 
child))) {
+            val toCopy: ListBuffer[TestNode] = new ListBuffer[TestNode]()
+            for (child <- currChildren) {
+              if (isExtraRecord(schemaNode, child)) {
+                for (subChild <- child.children) toCopy += subChild
+              } else toCopy += child
+            }
+            currChildren = toCopy
+          }
+          schemaNode.children = currChildren.toList
+        }
+      case _ =>
+    }
+    // call this helper method on each of this Nodes's children.  There is no 
infinite recursion because
+    // eventually a sub-child will not have any more children.  This of course 
assumes that we are dealing
+    // with a rooted tree (there shouldn't be any cycles), which should be the 
case.
+    for (child <- schemaNode.children) removeExtraRecords(child)
+  }
+
+  /**
+   * Perform postProcessing; this happens *after* the TestNode tree is created 
but *before* that tree
+   * gets converted to a TestSchema
+   */
+  private def postProcessing(): Unit = {
+    removeExtraRecords(objectStack.head)
+  }
+
+  private def getSimpleType(schemaNode: TestNode): FieldType = {
+    if (!schemaNode.isSimple) throw new IllegalArgumentException(s"Node 
$schemaNode is not a simple type!")
+    schemaNode.fieldType
+  }
+
+  private def nodeToField(schemaNode: TestNode): TestField = {
+    // by default, if this node doesn't have a name, its data type is used as 
the field name.
+    // This should only ever be the case for anonymous choices.
+    val recordField: TestField = new TestField(
+      schemaNode.name.getOrElse(schemaNode.fieldType.toString),
+      schemaNodeToDataType(schemaNode)
+    )
+    if (schemaNode.isOptional) new OptionalTestField(recordField) else 
recordField
+  }
+
+  private def newRecordSchema(schemaNode: TestNode, fieldList: 
List[TestField]): TestSchema = {
+    new TestSchema(schemaNode.name.getOrElse(""), schemaNode.namespace, 
fieldList)
+  }
+
+  /**
+   * Helper method to specifically convert a TestNode known to be a Record 
into a TestSchema.
+   * Each Child of the TestNode becomes a Field of the TestSchema
+   * @param schemaNode the node to convert to a TestSchema
+   * @return a tuple containing the finalized TestSchema and its corresponding 
Record Data Type
+   */
+  private def schemaNodeToRecordType(schemaNode: TestNode): RecordDataType = {
+    val fieldList: ListBuffer[TestField] = ListBuffer()
+    for (child <- schemaNode.children) fieldList += nodeToField(child)
+    val TestSchema: TestSchema = newRecordSchema(schemaNode, fieldList.toList)
+    RecordDataType(TestSchema)
+  }
+
+  /**
+   * Helper method to convert a TestNode known to be a choice into a Choice 
data type.
+   * This is able to handle a DFDL schema in which either named elements are 
directly sub-members of
+   * DFDL choices or if they are embedded in another element (which 
corresponds to being in a Record).
+   * In the end, if they are not already inside a Record, then they are put 
there. Choices cannot
+   * have fields, only possible sub-types, so anything that would be a "field" 
has a wrapper Record
+   * put around it.
+   * @param schemaNode the node to convert to a Choice data type
+   * @return a Choice data type as described above
+   */
+  private def choiceNodeToChoiceSchema(schemaNode: TestNode): DataType = {
+    val childList: ListBuffer[DataType] = new ListBuffer()
+    for (child <- schemaNode.children) {
+      if (child.name.isEmpty) {
+        childList += schemaNodeToRecordType(child)
+      } else {
+        // In the case where the child has a name, we embed it in an 
additional TestSchema,
+        // since we must add some DataType rather than a Field.
+        val fieldList: ListBuffer[TestField] = ListBuffer()
+        fieldList += nodeToField(child)
+        val choiceOption: TestSchema = newRecordSchema(child, fieldList.toList)
+        childList += RecordDataType(choiceOption)
+      }
+    }
+    ChoiceDataType(childList.toList)
+  }
+
+  /**
+   * Local helper method to appropriately convert a TestNode into an 
appropriate Record
+   * Data Type.  Records and Choices get routed to other helper methods, and 
Arrays are handled in the method.
+   * @param schemaNode the node to convert to a Record Data Type
+   * @return the finalized Record Data type
+   */
+  private def schemaNodeToDataType(schemaNode: TestNode): DataType = {
+    schemaNode.fieldType match {
+      case ArrayType =>
+        if (schemaNode.isSimple) {
+          ArrayDataType(new DataType(getSimpleType(schemaNode)))
+        } else if (schemaNode.children.nonEmpty) {
+          ArrayDataType(schemaNodeToRecordType(schemaNode))
+        } else {
+          throw new IllegalStateException(
+            s"Array Node $schemaNode either needs to be simple or have child 
nodes!"
+          )
+        }
+      case RecordType => schemaNodeToRecordType(schemaNode)
+      case ChoiceType => choiceNodeToChoiceSchema(schemaNode)
+      case _ => schemaNode.simpleType.get
+    }
+  }
+
+  /**
+   * In general, this method can be used to extract the TestSchema node from 
the tuple result
+   * of schemaNodeToRecordSchema.  However, it is really only used to kick off 
the conversion from
+   * the root TestNode.
+   * @param schemaNode the node from which a TestSchema will be extracted
+   * @return the processed TestSchema represented by the input
+   */
+  private def getRecordSchema(schemaNode: TestNode): TestSchema = {
+    schemaNodeToRecordType(schemaNode).childSchema
+  }
+
+}
+
+/**
+ * The sole purpose of this Companion Object is to contain this one static 
variable.
+ */
+object RecordWalker {
+  /**
+   * Used to suppress all of the extra printing from running test cases
+   */
+  final val PRODUCTION_MODE: Boolean = true

Review comment:
       I'm concerend that this test code is fairly complex, and that it's 
pretty easy for there to be a small bug in here which could lead to us not 
realizing there's an issue with the walker. I'm wondering if there's a less 
complex way to test the DSOM walker to verify that it's doing the right thing?
   
   We really just want to validate that the DSOM walker creates the correct 
events in the correct order. We don't really need to use these events to build 
anything to test that it's walking correctly. Perhaps it would be simpler if 
the TestDSOMWalker just received an ordered list of expected events and 
properties of those events. As the TestDSOMWalker walks the tree and receives 
events, it just ensures that they match the expected list of events.
   
   There's probably multiple approaches, but the first thing that comes to my 
mind is to implement customer the walker.Views that have a custom comparision 
function to compare things we want to test. Note that we don't necessarily need 
to check all properties, for example, it might make sense to ignore group, 
groupMembers, simpleType, complexType, etc. since these are all implied correct 
if we get the correct events. So one of these might look look something like 
this:
   ```scala
   case class TestRootView(
     override isSimpleType: Boolean,
     override isComplexType: Boolean,
     ...)
     extends walker.RootView {
     
     override def equals(that: AnyRef): Boolean =
       that match {
         case rv: RootView =>
           this.isSimpleType == rv.isSimpletype &&
           this.isComplexType == rv.isComplexType &&
           ...
       case _ => false
     }
   }
   ```
   We'd have a concrete TestView for each walker View we could get from the 
walker. Then the TestDSOMWalker could look something like this:
   ```scala
   class TestDSOMWalker(expectedEvents: Seq[TestView]) {
     var events = expectedEvents
     def onSomeEvent(view: walker.SomeView): Unit = {
       val head :: tail = events
       assertTrue(head == view)
       events = tail 
     }
   }
   ```
   Then to test, this becomes:
   ```
   def testWalker1() = {
     val expectedEvents = Seq(
       TestRootView(....),
       TestComplexTypeView(...),
       TestSequenceView(...),
       ...
     )
     ...
     val testWalker = new TestDSOMWalker(expectedEvents)
     testWalker.walk(pf.rootView)
   }
   ```
   
   There's still a little complexity here--we need to create these concrete 
views and ensure we're testing the right properties, including start/end 
events. But I think it's a little easier to verify that the walker 
implementation is doing the right thing with a simpler approach.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-daffodil] stevedlawrence commented on a change in pull request #394: Feature Addition: API for traversing DSOM and generating NIFI RecordSchemas from DSOM

Reply via email to