Re: Remove concept of slots from InfosetImpl Design Discussion

Taylor Wise Thu, 12 Oct 2017 16:01:26 -0700

Alright, currently debugging an out of bounds exception.  Yay... xD

________________________________
From: Steve Lawrence <[email protected]>
Sent: Wednesday, October 11, 2017 4:07:51 PM
To: [email protected]; Mike Beckerle; Taylor Wise
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


On 10/11/2017 01:01 PM, Mike Beckerle wrote:
> You have to hash lookup the name to append to an array if you're just 
> appending one thing at a time starting from the slot/hashID. The current 
> Infoset API hides the construction of the DIArray, and only provides access 
> via the slot/hashID, so there's no way to avoid the hash per each append.

The hashmap and hashlookups should really only be needed for elements
that are used in expressions. We came up with that optimization so that
we could remove the vast majority of hash lookups. Due to how
getChildArray works, that optimization can't occur. I'm not sure how
much this will affect performance, but there are ways to avoid the need
for a hash lookup for arrays not used in expressions. The removal of
DIArray was one. Maybe API/parser changes is better.

> But that's fixable if we want. Probably the API to infoset needs to allow 
> appending to an array more efficently, so that a loop like a repeat 
> combinator can be directly holding onto the DIArray's arrayBuffer appending 
> objects as the parser returns them. But I would not make this change as yet. 
> First things first. Don't optimize without profiling info, etc.
>
>
> I'm very reluctant to remove DIArray because XML's data model isn't the only 
> game in town. The XML decision to not have arrays, just adjacent things that 
> happen to have the same name, has all sorts of negative performance 
> implications.

The removal of a DIArray doesn't necessarily mean we don't have a
concept of an arrays. Every element added to Complex child arrayBuf
still knows if it part of an array via ERD.isArray. And comparing the
current DIElements ERD to the previous and next is enough to determine
where array starts and ends. From there, we can support any model that
requires a concept of an array. I'm not saying we should remove the
concept of arrays, just that our InfosetImpl doesn't necessarily need to
have a special DIArray. I'll agree that it does make the logic easier to
get it to a different model, but other things are more complicated.

> Consider this: we could create DIArrays of elements of simple type that don't 
> even create a DISimple node for each array element. The values could be 
> stored in the DIArray's arrayBuffer directly, eliminating all that overhead 
> of a DISimple node. (This was the original intent for DIArray, but was punted 
> to get it working sooner.) The opportunity to do this is entirely lost if you 
> don't have DIArray, but just a bunch of adjacent DIElements that happen to 
> share ERD.

Fair point. Seems a good argument for keeping DIArray, but if we don't
get around to doing this optimization, we're taking multiple hits. We're
still allocating many DISimples plus a DIArray, plus the hit from the
hashmap.

> More speculatively, when we someday implement incremental array trimming so 
> that streaming parse/unparse can drop most array elements once they've been 
> delivered to consumer, well then we will definitely want a DIArray object to 
> abstract the complexity of this trim behavior and the underlying 
> representation it requires.

It might simplify this, but I'm not entirely convinced. I think the
trimming of unused Infoset elements is going to have some complex
heuristics regardless. Adding an extra heuristic for arrays won't
necessarily make this easier. But this is just speculation. And again,
we would still know elements are part of an array, it just takes a
couple extra comparisons to determine start/end. But I do agree that
removal of an an entire array is much easier with a DIArray concept.



It seems to me these are good enough arguments for keeping DIArray
around, but it might be worth thinking a little bit about what changes
would be necessary to the API/parser to not need this hash lookup. If
it's a minor change, it could reduce a lot of hash lookups, especially
for large arrays, which are not uncommon.


>
> ________________________________
> From: Steve Lawrence <[email protected]>
> Sent: Wednesday, October 11, 2017 12:21 PM
> To: Taylor Wise; [email protected]
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>
> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>
>> What you'll notice is that the current implementation will construct the 
>> DIArray
>> if it doesn't already exist and append to it if it does already exist.  But 
>> it
>> needs the 'slot' number in order to check this.  The slot number is obtained
>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of 
>> slots
>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>> every time we want to append to a DIArray?
> I think you're right. The way things work right now, anytime we add to a
> a DIArray we would have to do a hash lookup on the DIComplex. It's not
> limited to just DPath.
>
> Maybe this is another argument for removing the concept of a DIArray? We
> still need to know if a DIElement is part of an array for the
> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
> element should just means creating a new DISimple/DIComplex and adding
> it to the DIComplex child ArrayBuffer and appending to the HashLookup
> Seq? This would also simplify some query style logic, e.g. fn:count.
>
> Alternatively, we could modify the parser logic so that when it's
> dealing with arrays it stores the DIArray locally somewhere (PState or
> Parser) and appends straight to that without having to ask the parent
> DIComplex for the DIArray? Not sure how big of a change that would be
> though.
>
>>
>>
>> Current Implementation:
>>
>> override def addChild(e: InfosetElement): Unit = {
>>
>>   if (e.runtimeData.isArray)  {
>>
>>    // Make sure there is an array to accept the child
>>
>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>
>>      Assert.invariant(arr ne null)
>>
>>      arr.append(e)
>>
>>    } else {
>>
>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>
>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>
>>    }
>>
>>    e.setParent(this)
>>
>> }
>>
>>
>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>      Assert.usage(childERD.isArray)
>>      getChildArray(childERD.dpathElementCompileInfo)
>>    }
>>
>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>      Assert.usage(info.isArray)
>>      val slot = info.slotIndexInParent
>>      getChildArray(slot)
>>    }
>>
>>    private def getChildArray(slot: Int): InfosetArray = {
>>      val slotVal = _slots(slot)
>>      if (slotVal ne null)
>>        slotVal match {
>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>          case _ => Assert.usageError("not an array")
>>        }
>>      else {
>>        val arrayERD = erd.childERDs(slot)
>>        // slot is null. There isn't even an array object yet.
>>        // create one (it will have zero entries)
>>        val ia = new DIArray(arrayERD, this)
>>        // no array there yet. So we have to create one.
>>        setChildArray(slot, ia)
>>        ia
>>      }
>>    }
>>
>>
>>
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <[email protected]>
>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>> *To:* [email protected]; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>> Goal: Remove concept of slots from InfosetImpl
>>>
>>> Reasons:
>>>     1. So that element/node lookup can be name based.
>>>     2. Ease re-enabling of Unordered Sequences.
>>>     3. Ease enabling of 'query-style' expressions that return more than
>>> one node.
>>>
>>> An Infoset is a document (DIDocument) and the document contains exactly
>>> one element (DIElement) as its root.  These elements can be simple
>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>> inherit from DINode.
>>>
>>> Currently DIComplex models its children as an Array of DINodes. The
>>> problem with the current implementation is that there's no way for us to
>>> just ask for an element by name without walking the structure.
>>>
>>> The solution involves modifying DIComplex to represent its children both
>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>> insertion ordering information we need regarding the children as well as
>>> constant time indexing access to the child in question.  The ArrayBuffer
>>> also has the added benefit of growing and shrinking as objects are added
>>> and removed.  While the HashMap will allow us to associate a name
>>> (String) with DINodes that have that name thus providing a fast named
>>> lookup.
>>>
>>> This named lookup can be optimized further by only inserting into the
>>> HashMap those elements that we know are used in expressions. These are
>>> the elements that would most likely benefit from a fast named lookup and
>>> will allow us to avoid additional overhead of hashing strings and
>>> allocating/appending to a sequence.
>>>
>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>> it's possible for the following:
>>>
>>>     <element name="a" .../>
>>>     <element name="a" maxOccurs="unbounded"..../>
>>>     <element name="b".../>
>>>     <element name="a" ..../>
>>>
>>> Each element named 'a' would have an associated DINode.  Because there
>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>
>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>     "b" -> mutable.Seq(DIElement)
>>>
>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>> follows:
>>>
>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>
>>> Back tracking shall be accomplished by going to the last item in the
>>> DIComplex ArrayBuffer and removing it.
>>>     1. If the last item is a DIElement, we simply delete it from the
>>> ArrayBuffer and then remove it also from the HashMap.
>>>     2. If the last item is a DIArray, we then have to look at the
>>> contents of the DIArray.
>>>         a. If the DIArray has more than one entry, remove the last
>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>> with the modified DIArray.
>>
>> I think the second part of this is actually unnecessary. Since the
>> DIArray is mutable, we only need to remove its last child. The entry in
>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>> only difference is the contents of the DIArray will be one smaller.
>>
>>>         b. If the DIArray has a single entry then we simply remove
>>> DIArray from ArrayBuffer and the HashMap.
>>>
>>>     <element name="root>
>>>          <sequence ../>
>>>          <element name="a" .../>
>>>          <element name="a" maxOccurs="unbounded">
>>>            ...
>>>            <element name="b" .../>
>>>            <element name="c" .../>
>>>            ...
>>>          </element>
>>>        </sequence>
>>>     </element>
>>>
>>> If the HashMap of DIComplex-root looks like the following:
>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>
>>> ArrayBuffer of DIComplex-root looks like the following:
>>>      0 -> DIElement("a")
>>>      1 -> DIArray("a")
>>>
>>> Where DIArray children ArrayBuffer looks like:
>>>      0 -> DIElement("b")
>>>      1 -> DIElement("c")
>>>
>>> Given the initial example if we backtrack(1):  // purge one item
>>>      DIArray:
>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>
>>>      HashMap DIComplex-root:
>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>
>>>      ArrayBuffer DIComplex-root:
>>>          0 -> DIElement("a")
>>>          1 -> DIArray("a")
>>>
>>> Given the initial example if we backtrack(2): // purge two items
>>>      DIArray:
>>>          Empty
>>>
>>>      HashMap DIComplex-root:
>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>                                         // removed and is now empty
>>>
>>>      ArrayBuffer DIComplex-root:
>>>          0 -> DIElement("a")
>>>
>>> ===
>>>
>>> class DIComplex {
>>>   def nSlots = erd.nChildSlots
>>
>> I think the concept of nSlots and nChildSlots should go away completely.
>> The children ArrayBuffer should just grow as elements are added and
>> shrink as they are backtracked. It might make sense to have a tunable
>> the determines the initial size of the children Array, but the Scala
>> defaults for initializing and growing a mutable array might already be
>> reasonable.
>>
>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>
>> One thing that might make this change easier is to not worry about the
>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>> really only necessary to support query style expressions, but that comes
>> with a whole other slew of concerns (e.g. fn:count, same name but
>> different types, etc.). There currently exist restrictions that do not
>> allow query style expressions, so this Seq should always be a size of 1
>> until remove those restrictions.
>>
>>>
>>>   def addChild(node: DINode) {
>>>     children += node
>>>     fastLookup(node.name).append(node)
>>>   }
>>>
>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>   }
>>>
>>>   def backtrack(int n) {
>>>     for (i = 0; i < n; i++) {
>>>       val endIdx = children.size -1
>>>       val child = children.remove(endIdx) // children.pop()
>>>
>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>         case None => // do nothing
>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>         case Some(nodesSeq) => {
>>>       val lastEntry = nodesSeq.last
>>>           lastEntry match {
>>>         case arr: DIArray => {
>>>               // If the array has >= 2 items, remove last one and update
>>> nodesSeq
>>>               // with this array:
>>>               //  val lastIdx = nodesSeq.length - 1
>>>           //  arr.remove(arr.size - 1)
>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>> arr))
>>>               //
>>>               // If the array has 1 item, remove the array from nodesSeq
>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>         }
>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>           }
>>>
>>>         }
>>>       }
>>>     }
>>>   }
>>>
>>> }
>>>
>>> class DIArray {
>>>     private val initialSize =
>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>
>>>     def append(ie: InfosetElement): Unit = {
>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>         ie.setArray(this)
>>>     }
>>>
>>>     //
>>>     // For backtracking purposes
>>>     //
>>>     def remove(n: Int): DIElement = {
>>>         _contents.remove(n)
>>>     }
>>> }
>>>
>>> ===
>>>
>>> To facillitate a breadth-first ordering scheme we shall have to
>>
>> That this is only necessary for unordered sequences, where the insertion
>> order in the ArrayBuffer differs from the display order, correct? If
>> that's true, we definitely want to make sure to only do the below
>> sorting when an element contains an unordred sequence.
>>
>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>> shall be called childElementDeclPosition and pertains to the static
>>> order of the element children of a complex type element.  When sorting
>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>> entry in the sequence that needs to be sorted.  All elements in this
>>> array will have the same childElementDeclPosition.
>>>
>>> The difficulty in implementing this is that an Element's
>>> dsom.ComplexType can have groups within the groups before finding actual
>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>> the model group object shall have a startingChildElementDeclPosition and
>>> an endingChildElementDeclPosition.
>>>
>>> 1. The startingChildElementPosition shall be passed to the constructor
>>> for the Term.
>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>> model group, then its first child term's
>>> startingChildElementDeclPosition is equal to its own
>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>> startingChildElementDeclPosition is the prior term's
>>> endingChildElementDeclPosition + 1.
>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>> endingChildElmentDeclPosition of its last child term.
>>>
>>
>> Looks good to me. +1. This should definitely make implementation of
>> unordered sequences and query style expressions easier. Removal of the
>> slot concept should also help remove concepts that prevent faster schema
>> compilation.
>>
>> - Steve
>>
>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Reply via email to