Re: Remove concept of slots from InfosetImpl Design Discussion

Steve Lawrence Tue, 03 Oct 2017 07:07:00 -0700

On 09/22/2017 04:25 PM, Taylor Wise wrote:
> Goal: Remove concept of slots from InfosetImpl
> 
> Reasons:
>     1. So that element/node lookup can be name based.
>     2. Ease re-enabling of Unordered Sequences.
>     3. Ease enabling of 'query-style' expressions that return more than
> one node.
> 
> An Infoset is a document (DIDocument) and the document contains exactly
> one element (DIElement) as its root.  These elements can be simple
> (DISimple) or complex (DIComplex).  Simple elements do not contain
> children, complex elements do.  DIArray represents an array (maxOccurs
>>1). A DIArray shall always be a child of DIComplex.  All of these
> inherit from DINode.
> 
> Currently DIComplex models its children as an Array of DINodes. The
> problem with the current implementation is that there's no way for us to
> just ask for an element by name without walking the structure.
> 
> The solution involves modifying DIComplex to represent its children both
> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
> insertion ordering information we need regarding the children as well as
> constant time indexing access to the child in question.  The ArrayBuffer
> also has the added benefit of growing and shrinking as objects are added
> and removed.  While the HashMap will allow us to associate a name
> (String) with DINodes that have that name thus providing a fast named
> lookup.
> 
> This named lookup can be optimized further by only inserting into the
> HashMap those elements that we know are used in expressions. These are
> the elements that would most likely benefit from a fast named lookup and
> will allow us to avoid additional overhead of hashing strings and
> allocating/appending to a sequence.
> 
> The reason for having the HashMap contain a mutable.Seq[DINode] is that
> it's possible for the following:
> 
>     <element name="a" .../>
>     <element name="a" maxOccurs="unbounded"..../>
>     <element name="b".../>
>     <element name="a" ..../>
> 
> Each element named 'a' would have an associated DINode.  Because there
> are multiple DINodes named 'a', we need to represent them in the HashMap
> using the same key 'a'.  To do this, we add all instances of 'a' to a
> Seq[DINode].  In this particular case, the HashMap would look as follows:
> 
>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>     "b" -> mutable.Seq(DIElement)
> 
> The mutable.ArrayBuffer maintains the insertion order and would look as
> follows:
> 
>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
> 
> Back tracking shall be accomplished by going to the last item in the
> DIComplex ArrayBuffer and removing it.
>     1. If the last item is a DIElement, we simply delete it from the
> ArrayBuffer and then remove it also from the HashMap.
>     2. If the last item is a DIArray, we then have to look at the
> contents of the DIArray.
>         a. If the DIArray has more than one entry, remove the last
> entry.  Then update the associated entry in the ArrayBuffer and HashMap
> with the modified DIArray.


I think the second part of this is actually unnecessary. Since the
DIArray is mutable, we only need to remove its last child. The entry in
the ArrayBuffer and HashMap will still point to the same DIArray. The
only difference is the contents of the DIArray will be one smaller.

>         b. If the DIArray has a single entry then we simply remove
> DIArray from ArrayBuffer and the HashMap.
> 
>     <element name="root>
>          <sequence ../>
>          <element name="a" .../>
>          <element name="a" maxOccurs="unbounded">
>            ...
>            <element name="b" .../>
>            <element name="c" .../>
>            ...
>          </element>
>        </sequence>
>     </element>
> 
> If the HashMap of DIComplex-root looks like the following:
>      "a" -> mutable.Seq(DIElement, DIArray)
> 
> ArrayBuffer of DIComplex-root looks like the following:
>      0 -> DIElement("a")
>      1 -> DIArray("a")
> 
> Where DIArray children ArrayBuffer looks like:
>      0 -> DIElement("b")
>      1 -> DIElement("c")
> 
> Given the initial example if we backtrack(1):  // purge one item
>      DIArray:
>          0 -> DIElement("b")  // removed one item from DIArray
> 
>      HashMap DIComplex-root:
>          "a" -> mutable.Seq(DIElement, DIArray)
> 
>      ArrayBuffer DIComplex-root:
>          0 -> DIElement("a")
>          1 -> DIArray("a")
> 
> Given the initial example if we backtrack(2): // purge two items
>      DIArray:
>          Empty
> 
>      HashMap DIComplex-root:
>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>                                         // removed and is now empty
> 
>      ArrayBuffer DIComplex-root:
>          0 -> DIElement("a")
> 
> ===
> 
> class DIComplex {
>   def nSlots = erd.nChildSlots

I think the concept of nSlots and nChildSlots should go away completely.
The children ArrayBuffer should just grow as elements are added and
shrink as they are backtracked. It might make sense to have a tunable
the determines the initial size of the children Array, but the Scala
defaults for initializing and growing a mutable array might already be
reasonable.

>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]

One thing that might make this change easier is to not worry about the
mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
really only necessary to support query style expressions, but that comes
with a whole other slew of concerns (e.g. fn:count, same name but
different types, etc.). There currently exist restrictions that do not
allow query style expressions, so this Seq should always be a size of 1
until remove those restrictions.

> 
>   def addChild(node: DINode) {
>     children += node
>     fastLookup(node.name).append(node)
>   }
> 
>   def getChild(name: String): immutable.Seq[DINode] {
>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>   }
> 
>   def backtrack(int n) {
>     for (i = 0; i < n; i++) {
>       val endIdx = children.size -1
>       val child = children.remove(endIdx) // children.pop()
> 
>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>         case None => // do nothing
>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>         case Some(nodesSeq) => {
>       val lastEntry = nodesSeq.last
>           lastEntry match {
>         case arr: DIArray => {
>               // If the array has >= 2 items, remove last one and update
> nodesSeq
>               // with this array:
>               //  val lastIdx = nodesSeq.length - 1
>           //  arr.remove(arr.size - 1)
>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
> arr))
>               //
>               // If the array has 1 item, remove the array from nodesSeq
>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>         }
>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>           }
> 
>         }
>       }
>     }
>   }
> 
> }
> 
> class DIArray {
>     private val initialSize =
> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
> 
>     def append(ie: InfosetElement): Unit = {
>         _contents.append(ie.asInstanceOf[DIElement])
>         ie.setArray(this)
>     }
> 
>     //
>     // For backtracking purposes
>     //
>     def remove(n: Int): DIElement = {
>         _contents.remove(n)
>     }
> }
> 
> ===
> 
> To facillitate a breadth-first ordering scheme we shall have to

That this is only necessary for unordered sequences, where the insertion
order in the ArrayBuffer differs from the display order, correct? If
that's true, we definitely want to make sure to only do the below
sorting when an element contains an unordred sequence.

> construct a 'global' that lives on the Element Runtime Data (ERD).  It
> shall be called childElementDeclPosition and pertains to the static
> order of the element children of a complex type element.  When sorting
> DINodes, childElementDeclPosition can be obtained by way of reaching
> from the DINode to the coresponding ERD.  Any array shall be a single
> entry in the sequence that needs to be sorted.  All elements in this
> array will have the same childElementDeclPosition.
> 
> The difficulty in implementing this is that an Element's
> dsom.ComplexType can have groups within the groups before finding actual
> element children.  Therefore in the Document Schema Object Model (DSOM)
> the model group object shall have a startingChildElementDeclPosition and
> an endingChildElementDeclPosition.
> 
> 1. The startingChildElementPosition shall be passed to the constructor
> for the Term.
> 2. If the term is an element, the endingChildElementDeclPosition is
> equal to the startingChildElementDeclPosition.  3. If the term is a
> model group, then its first child term's
> startingChildElementDeclPosition is equal to its own
> startingChildElementDeclPosition, the 2nd to Nth child term's
> startingChildElementDeclPosition is the prior term's
> endingChildElementDeclPosition + 1.
> 4. The endingChildElementDeclPosition for a model-group shall be the
> endingChildElmentDeclPosition of its last child term.
> 

Looks good to me. +1. This should definitely make implementation of
unordered sequences and query style expressions easier. Removal of the
slot concept should also help remove concepts that prevent faster schema
compilation.

- Steve

Re: Remove concept of slots from InfosetImpl Design Discussion

Reply via email to