Re: Remove concept of slots from InfosetImpl Design Discussion

Steve Lawrence Thu, 02 Nov 2017 09:14:50 -0700

After seeing a performance degredation with this recent change, I did
some profiling with PCAP to investigate the causes. I immediately saw a
lot of HashMap related allocations and insertions. I quickly hacked
together something to only create and insert into the HashMap if it was
used in expressions to see how it affected performance.


PCAP first, in a single parse, PCAP calls addChild about 1600 times, and
about 1000 of those times added a child that was used in an expression.
So this is pretty close to a worst case, where almost 2/3's of elements
need to be inserted into the HashMap. Fixing a couple other minor things
that showed up in profiling that I doubt had much of an affect, the
performance numbers I saw were:

With Slots:
   688.387 files/sec

Without Slots (unoptimized):
   553.145 files/sec

Without Slots (optimized):
   614.983 files/sec

So unoptimized is about 20% slower than slots, and optimized is about
10% slower than slots.


Looking at ATO, it calls addChild about 2600 times, with no elements
used in an expression, so HashMap allocations/lookups should be zero in
the optimized case. Performance numbers with this are:

With Slots:
   68.271 files/sec

Without Slots (unoptimized):
   64.623 files/sec

Without Slots (optimized):
   69.662 files/sec

So in this case, without slots/optimized is actually a hair faster than
with slots. Not sure why that would be the case, maybe there were lots
of empty slots in ATO taking up space? Regardless, this seems like a
good win.

We might be able to pick up that extra 10% in PCAP by interning QNames
to speed up HashMap lookups. Though, I didn't see anything in profiling
related to equality checks in the hash look ups, but sometimes
things are hard to see.

I've created DAFFODIL-1860 to track the issue for reducing HashMap
allocations and DAFFODIL-1861 to track the QName interning.

- Steve


On 10/23/2017 01:25 PM, Mike Beckerle wrote:
> No I think 2.1.0 as we don't understand the performance implications fully 
> yet.
> 
> 
> This should fix bug....
> 
> 
> https://opensource.ncsa.illinois.edu/jira/browse/DFDL-1773
> 
> [DFDL-1773] Choice ambiguous element name results in ... 
> <https://opensource.ncsa.illinois.edu/jira/browse/DFDL-1773>
> opensource.ncsa.illinois.edu
> Daffodil; DFDL-1773; Choice ambiguous element name results in failed 
> expression
> 
> 
> --------------------------------------------------------------------------------
> *From:* Taylor Wise
> *Sent:* Monday, October 23, 2017 12:43:36 PM
> *To:* Mike Beckerle; Steve Lawrence; [email protected]
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> Should this be pushed to 2.0.0?
> 
> --------------------------------------------------------------------------------
> *From:* Mike Beckerle
> *Sent:* Monday, October 23, 2017 12:36:57 PM
> *To:* Taylor Wise; Steve Lawrence; [email protected]
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> Yes. The one giant ticket about the space problem compiling (DFDL-1444) is 
> not 
> specific enough.
> 
> --------------------------------------------------------------------------------
> *From:* Taylor Wise
> *Sent:* Monday, October 23, 2017 12:28:04 PM
> *To:* Steve Lawrence; [email protected]; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> I don't believe there's a ticket associated with these changes.  Should I go 
> ahead and create one?
> 
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <[email protected]>
> *Sent:* Wednesday, October 18, 2017 8:21:02 AM
> *To:* Taylor Wise; [email protected]; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> On 10/17/2017 07:57 PM, Taylor Wise wrote:
>> Been debugging after implementing your suggested changes.
>> 
>> 
>> First thing to note, is that by moving the creation of the array out of 
>> getChildArray that I had to make additional calls to 'addChild' in the 
>> InfosetInputter.  This corrected things on the 'parser' side.  However, on 
>> the 
>> unparser side it became evident that this 'addChild' function cannot always 
>> set
>> the parent.  There's an invariant expecting parent to be null along the 
>> execution branch of the unparser side.  As a result, I modified addChild to 
>> also
>> accept a default boolean called 'setParent'.  It defaults to true unless 
>> specified.  So now when we call 'addChild' in InfosetInputter on the 
>> unparser 
>> side, we actually call 'addChild(e, false)'.
> 
> This sounds to me like addChild is getting called twice on the same
> element and so it's parent is trying to be set twice. Might need some
> more investigation. Maybe the InfosetInputter needs more tweaks to
> support not using getChildArray? Hard to say without seeing your
> InfosetInputter changes.
> 
> Regarding the below tests, I think your assessment is correct. The
> password element doesn't exist and so we should never even try to get
> the -1 index. Instead we should get something like "password element
> does not exist" or something. It would be useful to have that test to
> ensure we get that error message.
> 
> An index of -1 (or anything less than 1) can definitely be caught
> statically at schema compile time. Worth creating a bug for that. But
> there will be times where the index is calculated so we'll still need
> this runtime check. I think it's still a test worth having.
> 
> I think you're modifications are reasonable based on what the tests are
> trying to tests. Probably to proper coverage of tests we need is:
> 
> - trying to access an array that does not exist with both valid and
> invalid indicies (less than 1 and greater than array size)
> - trying to access an array thta does exist but with invalid indicies
> (less than 1 and greater than array size)
> 
> So it's probably worth keeping the existing tests for the 'array does
> not exist' tests and ensure we get a reasonable error message not
> talking about indices. And then just make your modifications new tests.
> 
>> 
>> The above correction reduced the errors from 30+ to just 2.  The only two 
>> errors
>> remaining have to do with an array index out of bounds error that's 
>> expected.  I
>> believe that it's actually an issue with the test and not the code.
>> 
>> 
>> Error:
>> 
>> edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic 
>> message 
>> "Value -1 is out of range" in any of the actual diagnostic messages:
>> Runtime Schema Definition Error: Expression Evaluation Error: Child element 
>> {http://example.com}password does not exist.
>> Schema context: hideme Location line 267 column 12 in 
>> file:/tmp/hiddenElem2658119394211244932.dfdl.xsd
>> Data location was preceding byte 0 limit(bytes) 8
>> 
>> Well, password doesn't exist because we are missing the 'pw:' initiator in 
>> the 
>> tdml:document content.  The code now doesn't know to create it because we 
>> haven't seen the initiator.  At least that's what I surmise.
>> 
>> 
>> Existing test:
>> 
>>    <!--
>>       Test Name: arrayIndexOutOfBounds_02
>>          Schema: hiddenElem
>>            Root: e1b
>>         Purpose: This test demonstrates that a proper error occurs if an 
>> array 
>> index in an expression is out of bounds
>>    -->
>> 
>>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>> 
>>      <tdml:document>p455w0rd</tdml:document>
>>        <tdml:errors>
>>          <tdml:error>Schema Definition Error</tdml:error>
>>          <tdml:error>expression evaluation error</tdml:error>
>>          <tdml:error>Value -1 is out of range</tdml:error>
>>          <tdml:error>with length 0</tdml:error>
>>        </tdml:errors>
>>    </tdml:parserTestCase>
>> 
>> Schema:
>>      <xs:group name="hiddenData">
>>        <xs:sequence dfdl:separator="|">
>>          <xs:element name="password" type="xs:string" dfdl:initiator="pw:" 
>> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
>>        </xs:sequence>
>>      </xs:group>
>> 
>>      <xs:element name="e1">
>>        <xs:complexType>
>>          <xs:sequence>
>>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
>> fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/>
>>          </xs:sequence>
>>        </xs:complexType>
>>      </xs:element>
>>      <xs:element name="e1b">
>>        <xs:complexType>
>>          <xs:sequence>
>>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
>> fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/>
>>          </xs:sequence>
>>        </xs:complexType>
>>      </xs:element>
>> 
>> 
>> My modification:
>> 
>>    <!--
>>       Test Name: arrayIndexOutOfBounds_02
>>          Schema: hiddenElem
>>            Root: e1b
>>         Purpose: This test demonstrates that a proper error occurs if an 
>> array 
>> index in an expression is out of bounds
>>    -->
>> 
>>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>> 
>>      <tdml:document>pw:p455w0rd</tdml:document>
>>        <tdml:errors>
>>          <tdml:error>Schema Definition Error</tdml:error>
>>          <tdml:error>expression evaluation error</tdml:error>
>>          <tdml:error>Value -1 is out of range</tdml:error>
>>          <tdml:error>with length 1</tdml:error>
>>        </tdml:errors>
>>    </tdml:parserTestCase>
>> 
>> But honestly, should we really need to do this?  Shouldn't the code know 
>> that -1
>> is not a valid index without having to force construction of the password 
>> array?
>> 
>> I'm similarly going to have to modify this other test as well by 
>> constructing a
>> different root that attempts to access a character in a position greater 
>> than 
>> the length of the array, unless there's something I'm missing:
>> 
>>    <!--
>>       Test Name: arrayIndexOutOfBounds_01
>>          Schema: hiddenElem
>>            Root: e1
>>         Purpose: This test demonstrates that a proper error occurs if an 
>> array 
>> index in an expression is out of bounds
>>    -->
>> 
>>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1"
>>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>> 
>>      <tdml:document>p455w0rd</tdml:document>
>>        <tdml:errors>
>>          <tdml:error>Schema Definition Error</tdml:error>
>>          <tdml:error>expression evaluation error</tdml:error>
>>          <tdml:error>Value 1 is out of range</tdml:error>
>>          <tdml:error>with length 0</tdml:error>
>>        </tdml:errors>
>>    </tdml:parserTestCase>
>> 
>> Proposed addition to defineSchema:
>> 
>>      <xs:element name="e1c">
>>        <xs:complexType>
>>          <xs:sequence>
>>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
>> fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/>
>>          </xs:sequence>
>>        </xs:complexType>
>>      </xs:element>
>> 
>> Proposed modification to test:
>> 
>>    <!--
>>       Test Name: arrayIndexOutOfBounds_01
>>          Schema: hiddenElem
>>            Root: e1
>>         Purpose: This test demonstrates that a proper error occurs if an 
>> array 
>> index in an expression is out of bounds
>>    -->
>> 
>>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c"
>>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>> 
>>      <tdml:document>pw:p455w0rd</tdml:document>
>>        <tdml:errors>
>>          <tdml:error>Schema Definition Error</tdml:error>
>>          <tdml:error>expression evaluation error</tdml:error>
>>          <tdml:error>Value 10 is out of range</tdml:error>
>>          <tdml:error>with length 1</tdml:error>
>>        </tdml:errors>
>>    </tdml:parserTestCase>
>> 
>> 
>> 
>> 
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <[email protected]>
>> *Sent:* Tuesday, October 17, 2017 11:04:03 AM
>> *To:* Taylor Wise; [email protected]; Mike Beckerle
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> I don't think a DIArray could be created earlier and things appended to
>> it. But, currently, I think an empty DIArray could be created earlier.
>> 
>> And I think this is really just a problem with getChildArray always
>> creating DIArrays. What I suspect happened is that we had an array
>> following by something that accessed that array, e.g.:
>> 
>> <xs:sequence>
>>    <xs:element name="arr" maxOccurs="unbounded" ... />
>>    <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
>> </xs:sequence>
>> 
>> arr was slot 1, count was slot 2.
>> 
>> The setChildArray function used to just do something like
>> 
>>    _lastSlotAdded = slot
>> 
>> Which worked in the case of adding a new array--we add a new array, so
>> the slot it was added in was the last slot added.
>> 
>> But when fn:count was called, _lastSlotAdded was 2 (from the count
>> element). fn:count ended up calling getChildArray, which created a
>> DIArray and called setChildArray, which overwrote _lastSlotAdded to the
>> slot of the array, i.e. 1. So even though the array was finished and
>> couldn't be modified, the use of getChildArray in DPath ended up setting
>> _lastSlotAdded to the wrong value. Using max was a way to fix this so we
>> couldn't set it to an earlier value.
>> 
>> So, now, because getChildArray create elements, it sounds like things
>> could be created out of order. Rather than inserting an empty DIArray
>> into a slot, it will added it to the end and potentially mess things up.
>> This says to me we definitely want to modify getChildArray to not create
>> DIArrays. It should either return the array if it exists, or throw an
>> Infoset exception, very similar to getChild. Then we don't have to worry
>> about DIArray's being created at the wrong time. The only thing that
>> should create a DIArray is addChild.
>> 
>> 
>> On 10/17/2017 10:13 AM, Taylor Wise wrote:
>>> Hmm, very interesting.  This would greatly simplify things.  But what about 
>>> this
>>> code snippet from before my changes:
>>> 
>>> 
>>>    //  final def setChildArray(slot: Int, arr: DIArray) {
>>>    //    Assert.invariant(_slots(slot) eq null)
>>>    //    _slots(slot) = arr
>>>    //    //
>>>    //    // Because arrays are created if not existing, an expression like 
>>> ../A[1]
>>>    //    // can cause array A to come into existence (though empty) and A 
>>> might be
>>>    //    // part of an expression that is reaching backward into earlier 
>>> data (when
>>>    //    // parsing) so that we might be creating an array earlier in the 
>>> slots
>>>    //    // of this element. Hence, the slot being passed here is not 
>>> necessarily
>>>    //    // the last slot added. It might be before it.
>>>    //    //
>>>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>>>    //  }
>>> Given your statement:
>>> 
>>> 
>>> "the issue is determining if the DIArray exists or not. Everything
>>> after that is the same. And since children are inserted in the order
>>> that we see them, a DIArray will always be at the end of the children
>>> ArrayBuffer if it exists. If the DINode at the end of the children
>>> ArrayBuffer is not this array, it needs to be created."
>>> 
>>> 
>>> To me it seems to imply that a DIArray can come into existence without any 
>>> contents.  But then be appended to at a later time.  In which case it's 
>>> possible
>>> that this DIArray then is not the last item appended.  It could be 
>>> somewhere 
>>> earlier and then need to be added to.
>>> 
>>> --------------------------------------------------------------------------------
>>> *From:* Steve Lawrence <[email protected]>
>>> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
>>> *To:* [email protected]; Mike Beckerle; Taylor Wise
>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>> Actually, I think we're over complicating the array situation. I think
>>> we can avoid needing a unique ID and I think we can also avoid needing
>>> to always insert arrays into a HashMap.
>>> 
>>> When addChild is called to add an DIElement to an array, there are two
>>> cases we need to think about:
>>> 
>>> 1) The DIArray does not exist and it must be created. The
>>> DIElement is appended to its contents.
>>> 
>>> 2) The DIArray does exist and we just need to find it. The
>>> DIElement is appended to its contents.
>>> 
>>> So the issue is determining if the DIArray exists or not. Everything
>>> after that is the same. And since children are inserted in the order
>>> that we see them, a DIArray will always be at the end of the children
>>> ArrayBuffer if it exists. If the DINode at the end of the children
>>> ArrayBuffer is not this array, it needs to be created. So addChild
>>> just becomes something like this:
>>> 
>>>    def addChild(e: DIElement) {
>>>      if (e.erd.isArray) {
>>>        if (children.isEmpty || children.last.erd != e.erd) {
>>>          // no children, or last child is not a DIArray for
>>>          // this element, create the DIArray
>>>          children.append(new DIArray(...))
>>>        }
>>>        // the array is now always last, add the new child to it
>>>        children.last.append(e)
>>>      } else {
>>>        children.append(e)
>>>      }
>>>    }
>>> 
>>> So we no longer need a unique ID or a tuple for the HashMap key. The
>>> fast HashMap lookup only needs to work on the name of the element, which
>>> allows query-style expressions to work in the future. And we now only
>>> need to insert DIArray's into the fast HashMap if they are used in
>>> expressions addChild no longer needs getChildArray, minimizing hash
>>> look ups.
>>> 
>>> Related, I'm wondering if getChildArray should change as well so that it
>>> does not create a DIArray if it doesn't exist? I think it seems
>>> reasonable that only addChild should create new DIArrays. And if
>>> something like DPath calls getChildArray, we probably don't want to end
>>> up creating an empty DIArray if it doesn't exist. It should probably
>>> just throw an InfosetNoSuchChildElementException if the array doesn't
>>> exist, just like getChild.
>>> 
>>> - Steve
>>> 
>>> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>>>> The ID might cause some issues when we want to support query-style
>>>> expressions. For example, say we have a schema like this:
>>>> 
>>>> <sequence>
>>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>>   <xs:element name="element" ... />
>>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>> </xs:sequence>
>>>> 
>>>> So two arrays with the same name separated by a single element. Each
>>>> element would have a unique ID, so would be inserted in the hash table
>>>> as something like
>>>> 
>>>> ("array", 1) -> Seq(DIArray)
>>>> ("element", 2) -> Seq(DISimple)
>>>> ("array", 3) -> Seq(DIArray)
>>>> 
>>>> A query style expression that accesses the "array" elements (e.g.
>>>> fn:count(./array)) would have trouble finding both arrays to get the
>>>> full count since the keys for the two arrays are the same.
>>>> 
>>>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>>>> up with something like this:
>>>> 
>>>> "array" -> Seq(DIArray, DIArray)
>>>> "element" -> Seq(DISimple)
>>>> 
>>>> And then just iterator of the Seq until we find DIArray with the right
>>>> ID? That's could have a performance hit if there were lots of things
>>>> with the same name, but I suspect that won't be super common.
>>>> 
>>>> Alternatively, we could relook at chaning the InfosetImpl API so that
>>>> when an DIArray is created, it's store in the PState for quick access,
>>>> and getChildArray would essentially go away.
>>>> 
>>>> - Steve
>>>> 
>>>> 
>>>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>>>> I think the key has to change to include name and namespace of element, 
>>>>> and also an additional integer to disambiguate the first a element from 
>>>>> the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any 
>>>>> value at all, so long as they are distinct.
>>>>>
>>>>>
>>>>> Let's make sure another case works. E.g.,  ordered sequence like the one 
>>>>> you have here, another element to separate it from  an unordered sequence 
>>>>> that also contains elements named a.
>>>>>
>>>>>
>>>>> However, an unordered sequence can't have more than 1 element named a, so 
>>>>> if every element a declaration gets an additional integer, the one in the 
>>>>> unordered sequence will also get one, and that will behave properly.
>>>>>
>>>>>
>>>>> Is this integer the same breadth-first ordering thing we discussed 
>>>>> elsewhere? I expect it is.
>>>>>
>>>>> ________________________________
>>>>> From: Taylor Wise
>>>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>>>> To: Steve Lawrence; [email protected]; Mike Beckerle
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>>
>>>>> I'm down to a single error: test_dfdlPosition3
>>>>>
>>>>>
>>>>> java.lang.Exception:
>>>>> Comparison failed.
>>>>> Expected
>>>>>           
>>>>> <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>>>> Actual
>>>>>           
>>>>> <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>>>> Differences were (path, expected, actual):
>>>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>>>> at 
>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>
>>>>>     <xs:element name="f">
>>>>>       <xs:complexType>
>>>>>         <xs:sequence>
>>>>>           <xs:element name="cnt1" type="xs:int"
>>>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ 
>>>>> xs:int(../ex:cnt1) }">
>>>>>             <xs:complexType>
>>>>>               <xs:sequence>
>>>>>                 <xs:element name="v" type="xs:int"
>>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>>               </xs:sequence>
>>>>>             </xs:complexType>
>>>>>           </xs:element>
>>>>>           <xs:element name="cnt2" type="xs:int"
>>>>>             dfdl:lengthKind="delimited" />
>>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ 
>>>>> xs:int(../ex:cnt2) }">
>>>>>             <xs:complexType>
>>>>>               <xs:sequence>
>>>>>                 <xs:element name="v" type="xs:int"
>>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>>               </xs:sequence>
>>>>>             </xs:complexType>
>>>>>           </xs:element>
>>>>>         </xs:sequence>
>>>>>       </xs:complexType>
>>>>>     </xs:element>
>>>>>
>>>>>
>>>>>
>>>>> I believe what's happening, from stepping through, is that the 
>>>>> getChildArray code is confused as to where to add the second set of <a/> 
>>>>> elements.  Below are the values of the structures at the point where 
>>>>> <cnt2/> has been added:
>>>>>
>>>>> ChildNodes aka Insertion Order
>>>>> ====
>>>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), 
>>>>> DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), 
>>>>> DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, 
>>>>> One(8)unk)))
>>>>>
>>>>> FastLookup
>>>>> ====
>>>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, 
>>>>> One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 
>>>>> valueLen(One(0)unk, One(8)unk))), a -> 
>>>>> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'),
>>>>>  DIComplex(name='a')))))
>>>>>
>>>>> So when we go to add the next array 'a', it sees that 'a' is already in 
>>>>> the fastLookup and appends it there.  This adds it to the existing 
>>>>> DIArray in the insertion order list.  Really, it should create a new 
>>>>> DIArray inserting it after <cnt2/> in the insertion  order list.  And 
>>>>> inserting it after the DIArray in the fastLookup Hash: a ->
>>> Seq(DIArray(a, ...), DIArray(a, ...))
>>>>>
>>>>> How do we fix this?
>>>>>
>>>>> Because this is an 'ordered' sequence, can we look at the 
>>>>> slotIndexInParent?  Isn't this the position in the sequence definition as 
>>>>> defined in the schema above?  This should differentiate between the two 
>>>>> 'a' arrays.  And then for 'unordered' sequences, this  shouldn't matter.  
>>>>> We shouldn't run into two or more members with the same
>>> name and namespace.  Correct?
>>>>>
>>>>> This also means that the 'key' will have to change to include the 
>>>>> namespace in addition to the element name.
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <[email protected]>
>>>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>>>> To: [email protected]; Taylor Wise; Mike Beckerle
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 
>>>>>> failure.  Expected results are slightly off.  I've got an additional 
>>>>>> <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>>>
>>>>>>
>>>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I 
>>>>>> changed it to be just a def that reflects the current length of 
>>>>>> childNodes (ArrayBuffer).
>>>>>> z
>>>>>>
>>>>>> I'm betting my current issue is within the restore functions I modified, 
>>>>>> but I'll continue looking at that in the morning.  I would think that 
>>>>>> 'restoring' would simply be copying the previous state... unless I'm 
>>>>>> misunderstanding something.
>>>>>>
>>>>>>
>>>>>>   abstract override def restoreInto(e: DIElement) {
>>>>>>     val c = e.asInstanceOf[DIComplex]
>>>>>>     c.restoreFrom(this)
>>>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>>>     super.restoreInto(e)
>>>>>>   }
>>>>>>
>>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>>>   // Shouldn't I be able to just say something like...
>>>>>>  //
>>>>>>  childNodes.clear()
>>>>>>  childNodes.append(cs.childNodes)
>>>>>>  fastLookup.clear()
>>>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>>>        fastLookup.put(name, seqDINodes) )
>>>>>> }
>>>>>
>>>>> This is going to be somewhat inefficient. This essentially deletes and
>>>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>>>> we backtrack. This could be pretty back if we childNodes had 1000
>>>>> elements and we just wanted to backtrack 1. Instead we just want to
>>>>> remove the elements that are being backtracked.
>>>>>
>>>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>>>> something like this (very similar to the existing restoreFrom):
>>>>>
>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>   var i = childNodes.length - 1
>>>>>
>>>>>   // for each child we will remove, remove it from the fastLookup map.
>>>>>   // this should always be the last element in the hashmap seq
>>>>>   while (i >= cs._numChildren) {
>>>>>     var childToRemove = childNodes(i)
>>>>>     var fastSeq = fastLookup(childToRemove.name)
>>>>>     if (fastSeq.length == 1)
>>>>>       // last one, remove the whole key entry
>>>>>       fastLookup.remove(childToRemove.name)
>>>>>     else
>>>>>       // not the last one, just drop the end
>>>>>       fastSeq.dropRight(1)
>>>>>     i -= 1
>>>>>   }
>>>>>
>>>>>   // now just quickly remove all those children
>>>>>   childNodes.reduceToSize(cs._numChildren)
>>>>>   _numChildren = cs._numChildren
>>>>>
>>>>>   if (cs._arraySize.isDefined) {
>>>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>>>   }
>>>>> }
>>>>>
>>>>> The logic here is very similar to the CURRENT code, just is a little
>>>>> more simple since it can make certain assumptions. Like removing
>>>>> something form the fastSeq just means removing the last one
>>>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>>>> DIArrays to simplify that logic.
>>>>>
>>>>> Note that we might want to rethink about the type fastLookup value.
>>>>> Rather than the value being a Seq, it probably wants to be some sort of
>>>>> mutable Seq that has constant time append dropRight. Maybe another
>>>>> ArrayBuf is the right data structure here? That would allow something like
>>>>>
>>>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> // ORIGINAL
>>>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>>   //
>>>>>>   //    var i = _lastSlotAdded
>>>>>>   //    while (i > cs._lastSlotAdded) {
>>>>>>   //      _slots(i) = null
>>>>>>   //      i -= 1
>>>>>>   //    }
>>>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>>>   //
>>>>>>   //    // check invariant. All slots after last slot added
>>>>>>   //    // should be null.
>>>>>>   //    {
>>>>>>   //      var i = _lastSlotAdded + 1
>>>>>>   //      while (i < _slots.length) {
>>>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>>>   //        i = i + 1
>>>>>>   //      }
>>>>>>   //    }
>>>>>>   //
>>>>>>   //    if (cs._arraySize.isDefined) {
>>>>>>   //      
>>>>>> _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>>>   //    }
>>>>>>   //  }
>>>>>>
>>>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus 
>>>>>> the question above
>>>>>> //
>>>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>>
>>>>>>     var i = _lastSlotAdded
>>>>>>     while (i > cs._lastSlotAdded) {
>>>>>>       val lastNode = childNodes.last
>>>>>>       val name = lastNode.namedQName.toQNameString
>>>>>>
>>>>>>       lastNode match {
>>>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>>>           // No need to delete the array, just the last DINode in it
>>>>>>           arr.pop()
>>>>>>         }
>>>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>>>           // Last DINode in array, remove the array entirely
>>>>>>           //
>>>>>>           arr.pop()
>>>>>>           //
>>>>>>           // Update insertion order list
>>>>>>           //
>>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>>           //
>>>>>>           // Update Fast Lookup
>>>>>>           //
>>>>>>           val seq = fastLookup(name)
>>>>>>           if (seq.length <= 1) {
>>>>>>             // Last DINode with THIS name, remove it entirely.
>>>>>>             //
>>>>>>             fastLookup.remove(name)
>>>>>>           } else {
>>>>>>             // Not the last DINode with THIS name, just remove this 
>>>>>> particular one.
>>>>>>             //
>>>>>>             val idx = seq.lastIndexWhere(node => 
>>>>>> node.isInstanceOf[DIArray])
>>>>>>             val result = seq diff Seq(lastNode)
>>>>>>             fastLookup.update(name, result)
>>>>>>           }
>>>>>>         }
>>>>>>         case _ => {
>>>>>>           //
>>>>>>           // Update insertion order list
>>>>>>           //
>>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>>           //
>>>>>>           // Update Fast Lookup
>>>>>>           //
>>>>>>           val seq = fastLookup(name)
>>>>>>           if (seq.length <= 1) {
>>>>>>             // Last DINode with THIS name, remove it entirely
>>>>>>             //
>>>>>>             fastLookup.remove(name)
>>>>>>           } else {
>>>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>>>             //
>>>>>>             val idx = seq.lastIndexWhere(node => 
>>>>>> !node.isInstanceOf[DIArray])
>>>>>>             val result = seq diff Seq(lastNode)
>>>>>>             fastLookup.update(name, result)
>>>>>>           }
>>>>>>         }
>>>>>>       }
>>>>>>     }
>>>>>>
>>>>>>   }
>>>>>>
>>>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet 
>>>>>> though.  Baby steps.
>>>>>> java.lang.Exception:
>>>>>> Comparison failed.
>>>>>> Expected
>>>>>>           
>>>>>> <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>>>> Actual
>>>>>>           
>>>>>> <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>>>> Differences were (path, expected, actual):
>>>>>>  (hb_root01,List(),List(<text/>))
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>>>> at 
>>>>>> edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> at 
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>> at 
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>> at 
>>>>>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>>>> at 
>>>>>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>>>> at 
>>>>>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>>>> at 
>>>>>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>>>> at 
>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>>>> at 
>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>>>> at 
>>>>>> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>>>> at 
>>>>>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>>>> at 
>>>>>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>>> at 
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>>>> at 
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>>>> at 
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>>>> at 
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Steve Lawrence <[email protected]>
>>>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>>>> To: [email protected]; Mike Beckerle; Taylor Wise
>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>
>>>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>>>> You have to hash lookup the name to append to an array if you're just 
>>>>>>> appending one thing at a time starting from the slot/hashID. The 
>>>>>>> current Infoset API hides the construction of the DIArray, and only 
>>>>>>> provides access via the slot/hashID, so there's no  way to avoid the 
>>>>>>> hash per each append.
>>>>>>
>>>>>> The hashmap and hashlookups should really only be needed for elements
>>>>>> that are used in expressions. We came up with that optimization so that
>>>>>> we could remove the vast majority of hash lookups. Due to how
>>>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>>>> much this will affect performance, but there are ways to avoid the need
>>>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>>>> DIArray was one. Maybe API/parser changes is better.
>>>>>>
>>>>>>> But that's fixable if we want. Probably the API to infoset needs to 
>>>>>>> allow appending to an array more efficently, so that a loop like a 
>>>>>>> repeat combinator can be directly holding onto the DIArray's 
>>>>>>> arrayBuffer appending objects as the parser returns them.  But I would 
>>>>>>> not make this change as yet. First things first. Don't optimize
>>> without profiling info, etc.
>>>>>>>
>>>>>>>
>>>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the 
>>>>>>> only game in town. The XML decision to not have arrays, just adjacent 
>>>>>>> things that happen to have the same name, has all sorts of negative 
>>>>>>> performance implications.
>>>>>>
>>>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>>>> current DIElements ERD to the previous and next is enough to determine
>>>>>> where array starts and ends. From there, we can support any model that
>>>>>> requires a concept of an array. I'm not saying we should remove the
>>>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>>>> get it to a different model, but other things are more complicated.
>>>>>>
>>>>>>> Consider this: we could create DIArrays of elements of simple type that 
>>>>>>> don't even create a DISimple node for each array element. The values 
>>>>>>> could be stored in the DIArray's arrayBuffer directly, eliminating all 
>>>>>>> that overhead of a DISimple node. (This  was the original intent for 
>>>>>>> DIArray, but was punted to get it working sooner.)
>>> The opportunity to do this is entirely lost if you don't have DIArray, but 
>>> just
>>> a bunch of adjacent DIElements that happen to share ERD.
>>>>>>
>>>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>>>> hashmap.
>>>>>>
>>>>>>> More speculatively, when we someday implement incremental array 
>>>>>>> trimming so that streaming parse/unparse can drop most array elements 
>>>>>>> once they've been delivered to consumer, well then we will definitely 
>>>>>>> want a DIArray object to abstract the complexity  of this trim behavior 
>>>>>>> and the underlying representation it requires.
>>>>>>
>>>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>>>> trimming of unused Infoset elements is going to have some complex
>>>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>>>> necessarily make this easier. But this is just speculation. And again,
>>>>>> we would still know elements are part of an array, it just takes a
>>>>>> couple extra comparisons to determine start/end. But I do agree that
>>>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>>>
>>>>>>
>>>>>>
>>>>>> It seems to me these are good enough arguments for keeping DIArray
>>>>>> around, but it might be worth thinking a little bit about what changes
>>>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>>>> for large arrays, which are not uncommon.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> ________________________________
>>>>>>> From: Steve Lawrence <[email protected]>
>>>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>>>> To: Taylor Wise; [email protected]
>>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>>
>>>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>>>
>>>>>>>> What you'll notice is that the current implementation will construct 
>>>>>>>> the DIArray
>>>>>>>> if it doesn't already exist and append to it if it does already exist. 
>>>>>>>>  But it
>>>>>>>> needs the 'slot' number in order to check this.  The slot number is 
>>>>>>>> obtained
>>>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this 
>>>>>>>> concept of slots
>>>>>>>> entirely, how can we accomplish this?  Will we have to do a named 
>>>>>>>> lookup then
>>>>>>>> every time we want to append to a DIArray?
>>>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>>>> limited to just DPath.
>>>>>>>
>>>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>>>> still need to know if a DIElement is part of an array for the
>>>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>>>
>>>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>>>> Parser) and appends straight to that without having to ask the parent
>>>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>>>> though.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Current Implementation:
>>>>>>>>
>>>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>>>
>>>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>>>
>>>>>>>>    // Make sure there is an array to accept the child
>>>>>>>>
>>>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>>>
>>>>>>>>      Assert.invariant(arr ne null)
>>>>>>>>
>>>>>>>>      arr.append(e)
>>>>>>>>
>>>>>>>>    } else {
>>>>>>>>
>>>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>>>
>>>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>>>
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    e.setParent(this)
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray 
>>>>>>>> = {
>>>>>>>>      Assert.usage(childERD.isArray)
>>>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    final def getChildArray(info: DPathElementCompileInfo): 
>>>>>>>> InfosetArray = {
>>>>>>>>      Assert.usage(info.isArray)
>>>>>>>>      val slot = info.slotIndexInParent
>>>>>>>>      getChildArray(slot)
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>>>      val slotVal = _slots(slot)
>>>>>>>>      if (slotVal ne null)
>>>>>>>>        slotVal match {
>>>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>>>        }
>>>>>>>>      else {
>>>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>>>        // create one (it will have zero entries)
>>>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>>>        // no array there yet. So we have to create one.
>>>>>>>>        setChildArray(slot, ia)
>>>>>>>>        ia
>>>>>>>>      }
>>>>>>>>    }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:* Steve Lawrence <[email protected]>
>>>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>>>> *To:* [email protected]; Taylor Wise
>>>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design 
>>>>>>>> Discussion
>>>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>>>
>>>>>>>>> Reasons:
>>>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>>>     3. Ease enabling of 'query-style' expressions that return more 
>>>>>>>>> than
>>>>>>>>> one node.
>>>>>>>>>
>>>>>>>>> An Infoset is a document (DIDocument) and the document contains 
>>>>>>>>> exactly
>>>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>>>> inherit from DINode.
>>>>>>>>>
>>>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>>>> problem with the current implementation is that there's no way for us 
>>>>>>>>> to
>>>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>>>
>>>>>>>>> The solution involves modifying DIComplex to represent its children 
>>>>>>>>> both
>>>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>>>> insertion ordering information we need regarding the children as well 
>>>>>>>>> as
>>>>>>>>> constant time indexing access to the child in question.  The 
>>>>>>>>> ArrayBuffer
>>>>>>>>> also has the added benefit of growing and shrinking as objects are 
>>>>>>>>> added
>>>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>>>> lookup.
>>>>>>>>>
>>>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>>>> the elements that would most likely benefit from a fast named lookup 
>>>>>>>>> and
>>>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>>>> allocating/appending to a sequence.
>>>>>>>>>
>>>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is 
>>>>>>>>> that
>>>>>>>>> it's possible for the following:
>>>>>>>>>
>>>>>>>>>     <element name="a" .../>
>>>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>>>     <element name="b".../>
>>>>>>>>>     <element name="a" ..../>
>>>>>>>>>
>>>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>>>> are multiple DINodes named 'a', we need to represent them in the 
>>>>>>>>> HashMap
>>>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as 
>>>>>>>>> follows:
>>>>>>>>>
>>>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>>>
>>>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look 
>>>>>>>>> as
>>>>>>>>> follows:
>>>>>>>>>
>>>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>>>
>>>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>>>> contents of the DIArray.
>>>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and 
>>>>>>>>> HashMap
>>>>>>>>> with the modified DIArray.
>>>>>>>>
>>>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>>>
>>>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>>>
>>>>>>>>>     <element name="root>
>>>>>>>>>          <sequence ../>
>>>>>>>>>          <element name="a" .../>
>>>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>>>            ...
>>>>>>>>>            <element name="b" .../>
>>>>>>>>>            <element name="c" .../>
>>>>>>>>>            ...
>>>>>>>>>          </element>
>>>>>>>>>        </sequence>
>>>>>>>>>     </element>
>>>>>>>>>
>>>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>>
>>>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>>>      0 -> DIElement("a")
>>>>>>>>>      1 -> DIArray("a")
>>>>>>>>>
>>>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>>>      0 -> DIElement("b")
>>>>>>>>>      1 -> DIElement("c")
>>>>>>>>>
>>>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>>>      DIArray:
>>>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>>>
>>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>>
>>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>>          0 -> DIElement("a")
>>>>>>>>>          1 -> DIArray("a")
>>>>>>>>>
>>>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>>>      DIArray:
>>>>>>>>>          Empty
>>>>>>>>>
>>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two 
>>>>>>>>> items
>>>>>>>>>                                         // removed and is now empty
>>>>>>>>>
>>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>>          0 -> DIElement("a")
>>>>>>>>>
>>>>>>>>> ===
>>>>>>>>>
>>>>>>>>> class DIComplex {
>>>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>>>
>>>>>>>> I think the concept of nSlots and nChildSlots should go away 
>>>>>>>> completely.
>>>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>>>> reasonable.
>>>>>>>>
>>>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>>>
>>>>>>>> One thing that might make this change easier is to not worry about the
>>>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>>>> really only necessary to support query style expressions, but that 
>>>>>>>> comes
>>>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>>>> until remove those restrictions.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>   def addChild(node: DINode) {
>>>>>>>>>     children += node
>>>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>>   def backtrack(int n) {
>>>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>>>       val endIdx = children.size -1
>>>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>>>
>>>>>>>>>       fastLookup.get(child.name) match {  // 
>>>>>>>>> fastLookup(child.name).pop()
>>>>>>>>>         case None => // do nothing
>>>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>>>         case Some(nodesSeq) => {
>>>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>>>           lastEntry match {
>>>>>>>>>         case arr: DIArray => {
>>>>>>>>>               // If the array has >= 2 items, remove last one and 
>>>>>>>>> update
>>>>>>>>> nodesSeq
>>>>>>>>>               // with this array:
>>>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>>>               //  fastLookup.update(child.name, 
>>>>>>>>> nodesSeq.update(lastIdx,
>>>>>>>>> arr))
>>>>>>>>>               //
>>>>>>>>>               // If the array has 1 item, remove the array from 
>>>>>>>>> nodesSeq
>>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>>         }
>>>>>>>>>             case _ => fastLookup.update(child.name, 
>>>>>>>>> nodesSeq.dropRight(1))
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>         }
>>>>>>>>>       }
>>>>>>>>>     }
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> class DIArray {
>>>>>>>>>     private val initialSize =
>>>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>>>     protected final val _contents = new 
>>>>>>>>> ArrayBuffer[DIElement](initialSize)
>>>>>>>>>
>>>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>>>         ie.setArray(this)
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>>     //
>>>>>>>>>     // For backtracking purposes
>>>>>>>>>     //
>>>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>>>         _contents.remove(n)
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> ===
>>>>>>>>>
>>>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>>>
>>>>>>>> That this is only necessary for unordered sequences, where the 
>>>>>>>> insertion
>>>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>>>> sorting when an element contains an unordred sequence.
>>>>>>>>
>>>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>>>
>>>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>>>> dsom.ComplexType can have groups within the groups before finding 
>>>>>>>>> actual
>>>>>>>>> element children.  Therefore in the Document Schema Object Model 
>>>>>>>>> (DSOM)
>>>>>>>>> the model group object shall have a startingChildElementDeclPosition 
>>>>>>>>> and
>>>>>>>>> an endingChildElementDeclPosition.
>>>>>>>>>
>>>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>>>> for the Term.
>>>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>>>> model group, then its first child term's
>>>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>>>> slot concept should also help remove concepts that prevent faster 
>>>>>>>> schema
>>>>>>>> compilation.
>>>>>>>>
>>>>>>>> - Steve
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> 
>>> 
>> 
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Reply via email to