Re: Remove concept of slots from InfosetImpl Design Discussion

Mike Beckerle Fri, 13 Oct 2017 11:15:34 -0700

I think the key has to change to include name and namespace of element, and 
also an additional integer to disambiguate the first a element from the 2nd, 
from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so 
long as they are distinct.



Let's make sure another case works. E.g.,  ordered sequence like the one you 
have here, another element to separate it from  an unordered sequence that also 
contains elements named a.


However, an unordered sequence can't have more than 1 element named a, so if 
every element a declaration gets an additional integer, the one in the 
unordered sequence will also get one, and that will behave properly.


Is this integer the same breadth-first ordering thing we discussed elsewhere? I 
expect it is.

________________________________
From: Taylor Wise
Sent: Friday, October 13, 2017 1:55:22 PM
To: Steve Lawrence; [email protected]; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


I'm down to a single error: test_dfdlPosition3


java.lang.Exception:
Comparison failed.
Expected
          
<f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
Actual
          
<f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
Differences were (path, expected, actual):
 (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)

    <xs:element name="f">
      <xs:complexType>
        <xs:sequence>
          <xs:element name="cnt1" type="xs:int"
            dfdl:lengthKind="delimited" dfdl:terminator=";" />
          <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
            dfdl:occursCountKind="expression" dfdl:occursCount="{ 
xs:int(../ex:cnt1) }">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="v" type="xs:int"
                  dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
          <xs:element name="cnt2" type="xs:int"
            dfdl:lengthKind="delimited" />
          <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
            dfdl:occursCountKind="expression" dfdl:occursCount="{ 
xs:int(../ex:cnt2) }">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="v" type="xs:int"
                  dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
        </xs:sequence>
      </xs:complexType>
    </xs:element>



I believe what's happening, from stepping through, is that the getChildArray 
code is confused as to where to add the second set of <a/> elements.  Below are 
the values of the structures at the point where <cnt2/> has been added:

ChildNodes aka Insertion Order
====
ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), 
DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), 
DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))

FastLookup
====
Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, 
One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, 
One(8)unk))), a -> 
ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), 
DIComplex(name='a')))))

So when we go to add the next array 'a', it sees that 'a' is already in the 
fastLookup and appends it there.  This adds it to the existing DIArray in the 
insertion order list.  Really, it should create a new DIArray inserting it 
after <cnt2/> in the insertion order list.  And inserting it after the DIArray 
in the fastLookup Hash: a -> Seq(DIArray(a, ...), DIArray(a, ...))

How do we fix this?

Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  
Isn't this the position in the sequence definition as defined in the schema 
above?  This should differentiate between the two 'a' arrays.  And then for 
'unordered' sequences, this shouldn't matter.  We shouldn't run into two or 
more members with the same name and namespace.  Correct?

This also means that the 'key' will have to change to include the namespace in 
addition to the element name.



________________________________
From: Steve Lawrence <[email protected]>
Sent: Friday, October 13, 2017 8:45:06 AM
To: [email protected]; Taylor Wise; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/12/2017 08:19 PM, Taylor Wise wrote:
> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  
> Expected results are slightly off.  I've got an additional <text/> in the 
> result.  So, getting closer.  Stopping for the night.
>
>
> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed 
> it to be just a def that reflects the current length of childNodes 
> (ArrayBuffer).
> z
>
> I'm betting my current issue is within the restore functions I modified, but 
> I'll continue looking at that in the morning.  I would think that 'restoring' 
> would simply be copying the previous state... unless I'm misunderstanding 
> something.
>
>
>   abstract override def restoreInto(e: DIElement) {
>     val c = e.asInstanceOf[DIComplex]
>     c.restoreFrom(this)
>     //c._lastSlotAdded = this._lastSlotAdded
>     super.restoreInto(e)
>   }
>
> final def restoreFrom(cs: DIComplexSharedImplMixin){
>   // Shouldn't I be able to just say something like...
>  //
>  childNodes.clear()
>  childNodes.append(cs.childNodes)
>  fastLookup.clear()
>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>        fastLookup.put(name, seqDINodes) )
> }

This is going to be somewhat inefficient. This essentially deletes and
recreates the entire childNodes array and fastLookup hashmap everytime
we backtrack. This could be pretty back if we childNodes had 1000
elements and we just wanted to backtrack 1. Instead we just want to
remove the elements that are being backtracked.

So I think you still need a concept of _lastSlotAdded. But rather than
the lastSlotAdded, it's just the length of the children (maybe call it
_numChildren?). I would expect restoreFrom for DIComplex to look
something like this (very similar to the existing restoreFrom):

final def restoreFrom(cs: DIComplexSharedImplMixin) {
  var i = childNodes.length - 1

  // for each child we will remove, remove it from the fastLookup map.
  // this should always be the last element in the hashmap seq
  while (i >= cs._numChildren) {
    var childToRemove = childNodes(i)
    var fastSeq = fastLookup(childToRemove.name)
    if (fastSeq.length == 1)
      // last one, remove the whole key entry
      fastLookup.remove(childToRemove.name)
    else
      // not the last one, just drop the end
      fastSeq.dropRight(1)
    i -= 1
  }

  // now just quickly remove all those children
  childNodes.reduceToSize(cs._numChildren)
  _numChildren = cs._numChildren

  if (cs._arraySize.isDefined) {
    childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
  }
}

The logic here is very similar to the CURRENT code, just is a little
more simple since it can make certain assumptions. Like removing
something form the fastSeq just means removing the last one
(dropRight(1)), and the existing _arraySize handles the special case for
DIArrays to simplify that logic.

Note that we might want to rethink about the type fastLookup value.
Rather than the value being a Seq, it probably wants to be some sort of
mutable Seq that has constant time append dropRight. Maybe another
ArrayBuf is the right data structure here? That would allow something like

  fastLookup.reduceToSize(fastSeq.length - 1)




> // ORIGINAL
>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>   //
>   //    var i = _lastSlotAdded
>   //    while (i > cs._lastSlotAdded) {
>   //      _slots(i) = null
>   //      i -= 1
>   //    }
>   //    _lastSlotAdded = cs._lastSlotAdded
>   //
>   //    // check invariant. All slots after last slot added
>   //    // should be null.
>   //    {
>   //      var i = _lastSlotAdded + 1
>   //      while (i < _slots.length) {
>   //        Assert.invariant(_slots(i) eq null)
>   //        i = i + 1
>   //      }
>   //    }
>   //
>   //    if (cs._arraySize.isDefined) {
>   //      
> _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>   //    }
>   //  }
>
> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the 
> question above
> //
>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>     Assert.invariant(_arraySize == MaybeInt.Nope)
>
>     var i = _lastSlotAdded
>     while (i > cs._lastSlotAdded) {
>       val lastNode = childNodes.last
>       val name = lastNode.namedQName.toQNameString
>
>       lastNode match {
>         case arr: DIArray if (arr.length > 1) => {
>           // No need to delete the array, just the last DINode in it
>           arr.pop()
>         }
>         case arr: DIArray if (arr.length == 1) => {
>           // Last DINode in array, remove the array entirely
>           //
>           arr.pop()
>           //
>           // Update insertion order list
>           //
>           childNodes.remove(childNodes.length - 1)
>           //
>           // Update Fast Lookup
>           //
>           val seq = fastLookup(name)
>           if (seq.length <= 1) {
>             // Last DINode with THIS name, remove it entirely.
>             //
>             fastLookup.remove(name)
>           } else {
>             // Not the last DINode with THIS name, just remove this 
> particular one.
>             //
>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>             val result = seq diff Seq(lastNode)
>             fastLookup.update(name, result)
>           }
>         }
>         case _ => {
>           //
>           // Update insertion order list
>           //
>           childNodes.remove(childNodes.length - 1)
>           //
>           // Update Fast Lookup
>           //
>           val seq = fastLookup(name)
>           if (seq.length <= 1) {
>             // Last DINode with THIS name, remove it entirely
>             //
>             fastLookup.remove(name)
>           } else {
>             // Not the last DINode with THIS name.  Just remove this one!
>             //
>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>             val result = seq diff Seq(lastNode)
>             fastLookup.update(name, result)
>           }
>         }
>       }
>     }
>
>   }
>
> The last failure in SimpleTypes, I haven't run the rest of the suite yet 
> though.  Baby steps.
> java.lang.Exception:
> Comparison failed.
> Expected
>           
> <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
> Actual
>           
> <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
> Differences were (path, expected, actual):
>  (hb_root01,List(),List(<text/>))
> at 
> edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
> at 
> edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
> at 
> edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
> at 
> edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
> at 
> edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
> at scala.util.Either$RightProjection.foreach(Either.scala:468)
> at 
> edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
> at 
> edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
> at 
> edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
> at 
> edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at 
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
> at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
> at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
> at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
> at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
> at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>
>
>
>
>
> ________________________________
> From: Steve Lawrence <[email protected]>
> Sent: Wednesday, October 11, 2017 4:07:51 PM
> To: [email protected]; Mike Beckerle; Taylor Wise
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>
> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>> You have to hash lookup the name to append to an array if you're just 
>> appending one thing at a time starting from the slot/hashID. The current 
>> Infoset API hides the construction of the DIArray, and only provides access 
>> via the slot/hashID, so there's no way to avoid the hash per each append.
>
> The hashmap and hashlookups should really only be needed for elements
> that are used in expressions. We came up with that optimization so that
> we could remove the vast majority of hash lookups. Due to how
> getChildArray works, that optimization can't occur. I'm not sure how
> much this will affect performance, but there are ways to avoid the need
> for a hash lookup for arrays not used in expressions. The removal of
> DIArray was one. Maybe API/parser changes is better.
>
>> But that's fixable if we want. Probably the API to infoset needs to allow 
>> appending to an array more efficently, so that a loop like a repeat 
>> combinator can be directly holding onto the DIArray's arrayBuffer appending 
>> objects as the parser returns them. But I would not make this change as yet. 
>> First things first. Don't optimize without profiling info, etc.
>>
>>
>> I'm very reluctant to remove DIArray because XML's data model isn't the only 
>> game in town. The XML decision to not have arrays, just adjacent things that 
>> happen to have the same name, has all sorts of negative performance 
>> implications.
>
> The removal of a DIArray doesn't necessarily mean we don't have a
> concept of an arrays. Every element added to Complex child arrayBuf
> still knows if it part of an array via ERD.isArray. And comparing the
> current DIElements ERD to the previous and next is enough to determine
> where array starts and ends. From there, we can support any model that
> requires a concept of an array. I'm not saying we should remove the
> concept of arrays, just that our InfosetImpl doesn't necessarily need to
> have a special DIArray. I'll agree that it does make the logic easier to
> get it to a different model, but other things are more complicated.
>
>> Consider this: we could create DIArrays of elements of simple type that 
>> don't even create a DISimple node for each array element. The values could 
>> be stored in the DIArray's arrayBuffer directly, eliminating all that 
>> overhead of a DISimple node. (This was the original intent for DIArray, but 
>> was punted to get it working sooner.) The opportunity to do this is entirely 
>> lost if you don't have DIArray, but just a bunch of adjacent DIElements that 
>> happen to share ERD.
>
> Fair point. Seems a good argument for keeping DIArray, but if we don't
> get around to doing this optimization, we're taking multiple hits. We're
> still allocating many DISimples plus a DIArray, plus the hit from the
> hashmap.
>
>> More speculatively, when we someday implement incremental array trimming so 
>> that streaming parse/unparse can drop most array elements once they've been 
>> delivered to consumer, well then we will definitely want a DIArray object to 
>> abstract the complexity of this trim behavior and the underlying 
>> representation it requires.
>
> It might simplify this, but I'm not entirely convinced. I think the
> trimming of unused Infoset elements is going to have some complex
> heuristics regardless. Adding an extra heuristic for arrays won't
> necessarily make this easier. But this is just speculation. And again,
> we would still know elements are part of an array, it just takes a
> couple extra comparisons to determine start/end. But I do agree that
> removal of an an entire array is much easier with a DIArray concept.
>
>
>
> It seems to me these are good enough arguments for keeping DIArray
> around, but it might be worth thinking a little bit about what changes
> would be necessary to the API/parser to not need this hash lookup. If
> it's a minor change, it could reduce a lot of hash lookups, especially
> for large arrays, which are not uncommon.
>
>
>>
>> ________________________________
>> From: Steve Lawrence <[email protected]>
>> Sent: Wednesday, October 11, 2017 12:21 PM
>> To: Taylor Wise; [email protected]
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>
>>> What you'll notice is that the current implementation will construct the 
>>> DIArray
>>> if it doesn't already exist and append to it if it does already exist.  But 
>>> it
>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of 
>>> slots
>>> entirely, how can we accomplish this?  Will we have to do a named lookup 
>>> then
>>> every time we want to append to a DIArray?
>> I think you're right. The way things work right now, anytime we add to a
>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>> limited to just DPath.
>>
>> Maybe this is another argument for removing the concept of a DIArray? We
>> still need to know if a DIElement is part of an array for the
>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>> element should just means creating a new DISimple/DIComplex and adding
>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>
>> Alternatively, we could modify the parser logic so that when it's
>> dealing with arrays it stores the DIArray locally somewhere (PState or
>> Parser) and appends straight to that without having to ask the parent
>> DIComplex for the DIArray? Not sure how big of a change that would be
>> though.
>>
>>>
>>>
>>> Current Implementation:
>>>
>>> override def addChild(e: InfosetElement): Unit = {
>>>
>>>   if (e.runtimeData.isArray)  {
>>>
>>>    // Make sure there is an array to accept the child
>>>
>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>
>>>      Assert.invariant(arr ne null)
>>>
>>>      arr.append(e)
>>>
>>>    } else {
>>>
>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>
>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>
>>>    }
>>>
>>>    e.setParent(this)
>>>
>>> }
>>>
>>>
>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>      Assert.usage(childERD.isArray)
>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>    }
>>>
>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>      Assert.usage(info.isArray)
>>>      val slot = info.slotIndexInParent
>>>      getChildArray(slot)
>>>    }
>>>
>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>      val slotVal = _slots(slot)
>>>      if (slotVal ne null)
>>>        slotVal match {
>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>          case _ => Assert.usageError("not an array")
>>>        }
>>>      else {
>>>        val arrayERD = erd.childERDs(slot)
>>>        // slot is null. There isn't even an array object yet.
>>>        // create one (it will have zero entries)
>>>        val ia = new DIArray(arrayERD, this)
>>>        // no array there yet. So we have to create one.
>>>        setChildArray(slot, ia)
>>>        ia
>>>      }
>>>    }
>>>
>>>
>>>
>>> --------------------------------------------------------------------------------
>>> *From:* Steve Lawrence <[email protected]>
>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>> *To:* [email protected]; Taylor Wise
>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>> Goal: Remove concept of slots from InfosetImpl
>>>>
>>>> Reasons:
>>>>     1. So that element/node lookup can be name based.
>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>> one node.
>>>>
>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>> one element (DIElement) as its root.  These elements can be simple
>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>> inherit from DINode.
>>>>
>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>> problem with the current implementation is that there's no way for us to
>>>> just ask for an element by name without walking the structure.
>>>>
>>>> The solution involves modifying DIComplex to represent its children both
>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>> insertion ordering information we need regarding the children as well as
>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>> also has the added benefit of growing and shrinking as objects are added
>>>> and removed.  While the HashMap will allow us to associate a name
>>>> (String) with DINodes that have that name thus providing a fast named
>>>> lookup.
>>>>
>>>> This named lookup can be optimized further by only inserting into the
>>>> HashMap those elements that we know are used in expressions. These are
>>>> the elements that would most likely benefit from a fast named lookup and
>>>> will allow us to avoid additional overhead of hashing strings and
>>>> allocating/appending to a sequence.
>>>>
>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>> it's possible for the following:
>>>>
>>>>     <element name="a" .../>
>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>     <element name="b".../>
>>>>     <element name="a" ..../>
>>>>
>>>> Each element named 'a' would have an associated DINode.  Because there
>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>
>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>     "b" -> mutable.Seq(DIElement)
>>>>
>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>> follows:
>>>>
>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>
>>>> Back tracking shall be accomplished by going to the last item in the
>>>> DIComplex ArrayBuffer and removing it.
>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>     2. If the last item is a DIArray, we then have to look at the
>>>> contents of the DIArray.
>>>>         a. If the DIArray has more than one entry, remove the last
>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>> with the modified DIArray.
>>>
>>> I think the second part of this is actually unnecessary. Since the
>>> DIArray is mutable, we only need to remove its last child. The entry in
>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>> only difference is the contents of the DIArray will be one smaller.
>>>
>>>>         b. If the DIArray has a single entry then we simply remove
>>>> DIArray from ArrayBuffer and the HashMap.
>>>>
>>>>     <element name="root>
>>>>          <sequence ../>
>>>>          <element name="a" .../>
>>>>          <element name="a" maxOccurs="unbounded">
>>>>            ...
>>>>            <element name="b" .../>
>>>>            <element name="c" .../>
>>>>            ...
>>>>          </element>
>>>>        </sequence>
>>>>     </element>
>>>>
>>>> If the HashMap of DIComplex-root looks like the following:
>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>
>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>      0 -> DIElement("a")
>>>>      1 -> DIArray("a")
>>>>
>>>> Where DIArray children ArrayBuffer looks like:
>>>>      0 -> DIElement("b")
>>>>      1 -> DIElement("c")
>>>>
>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>      DIArray:
>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>
>>>>      HashMap DIComplex-root:
>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>
>>>>      ArrayBuffer DIComplex-root:
>>>>          0 -> DIElement("a")
>>>>          1 -> DIArray("a")
>>>>
>>>> Given the initial example if we backtrack(2): // purge two items
>>>>      DIArray:
>>>>          Empty
>>>>
>>>>      HashMap DIComplex-root:
>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>                                         // removed and is now empty
>>>>
>>>>      ArrayBuffer DIComplex-root:
>>>>          0 -> DIElement("a")
>>>>
>>>> ===
>>>>
>>>> class DIComplex {
>>>>   def nSlots = erd.nChildSlots
>>>
>>> I think the concept of nSlots and nChildSlots should go away completely.
>>> The children ArrayBuffer should just grow as elements are added and
>>> shrink as they are backtracked. It might make sense to have a tunable
>>> the determines the initial size of the children Array, but the Scala
>>> defaults for initializing and growing a mutable array might already be
>>> reasonable.
>>>
>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>
>>> One thing that might make this change easier is to not worry about the
>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>> really only necessary to support query style expressions, but that comes
>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>> different types, etc.). There currently exist restrictions that do not
>>> allow query style expressions, so this Seq should always be a size of 1
>>> until remove those restrictions.
>>>
>>>>
>>>>   def addChild(node: DINode) {
>>>>     children += node
>>>>     fastLookup(node.name).append(node)
>>>>   }
>>>>
>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>   }
>>>>
>>>>   def backtrack(int n) {
>>>>     for (i = 0; i < n; i++) {
>>>>       val endIdx = children.size -1
>>>>       val child = children.remove(endIdx) // children.pop()
>>>>
>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>         case None => // do nothing
>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>         case Some(nodesSeq) => {
>>>>       val lastEntry = nodesSeq.last
>>>>           lastEntry match {
>>>>         case arr: DIArray => {
>>>>               // If the array has >= 2 items, remove last one and update
>>>> nodesSeq
>>>>               // with this array:
>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>           //  arr.remove(arr.size - 1)
>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>> arr))
>>>>               //
>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>         }
>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>           }
>>>>
>>>>         }
>>>>       }
>>>>     }
>>>>   }
>>>>
>>>> }
>>>>
>>>> class DIArray {
>>>>     private val initialSize =
>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>
>>>>     def append(ie: InfosetElement): Unit = {
>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>         ie.setArray(this)
>>>>     }
>>>>
>>>>     //
>>>>     // For backtracking purposes
>>>>     //
>>>>     def remove(n: Int): DIElement = {
>>>>         _contents.remove(n)
>>>>     }
>>>> }
>>>>
>>>> ===
>>>>
>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>
>>> That this is only necessary for unordered sequences, where the insertion
>>> order in the ArrayBuffer differs from the display order, correct? If
>>> that's true, we definitely want to make sure to only do the below
>>> sorting when an element contains an unordred sequence.
>>>
>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>> shall be called childElementDeclPosition and pertains to the static
>>>> order of the element children of a complex type element.  When sorting
>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>> array will have the same childElementDeclPosition.
>>>>
>>>> The difficulty in implementing this is that an Element's
>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>> the model group object shall have a startingChildElementDeclPosition and
>>>> an endingChildElementDeclPosition.
>>>>
>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>> for the Term.
>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>> model group, then its first child term's
>>>> startingChildElementDeclPosition is equal to its own
>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>> startingChildElementDeclPosition is the prior term's
>>>> endingChildElementDeclPosition + 1.
>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>> endingChildElmentDeclPosition of its last child term.
>>>>
>>>
>>> Looks good to me. +1. This should definitely make implementation of
>>> unordered sequences and query style expressions easier. Removal of the
>>> slot concept should also help remove concepts that prevent faster schema
>>> compilation.
>>>
>>> - Steve
>>>
>>
>>
>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Reply via email to