After seeing a performance degredation with this recent change, I did some profiling with PCAP to investigate the causes. I immediately saw a lot of HashMap related allocations and insertions. I quickly hacked together something to only create and insert into the HashMap if it was used in expressions to see how it affected performance.
PCAP first, in a single parse, PCAP calls addChild about 1600 times, and about 1000 of those times added a child that was used in an expression. So this is pretty close to a worst case, where almost 2/3's of elements need to be inserted into the HashMap. Fixing a couple other minor things that showed up in profiling that I doubt had much of an affect, the performance numbers I saw were: With Slots: 688.387 files/sec Without Slots (unoptimized): 553.145 files/sec Without Slots (optimized): 614.983 files/sec So unoptimized is about 20% slower than slots, and optimized is about 10% slower than slots. Looking at ATO, it calls addChild about 2600 times, with no elements used in an expression, so HashMap allocations/lookups should be zero in the optimized case. Performance numbers with this are: With Slots: 68.271 files/sec Without Slots (unoptimized): 64.623 files/sec Without Slots (optimized): 69.662 files/sec So in this case, without slots/optimized is actually a hair faster than with slots. Not sure why that would be the case, maybe there were lots of empty slots in ATO taking up space? Regardless, this seems like a good win. We might be able to pick up that extra 10% in PCAP by interning QNames to speed up HashMap lookups. Though, I didn't see anything in profiling related to equality checks in the hash look ups, but sometimes things are hard to see. I've created DAFFODIL-1860 to track the issue for reducing HashMap allocations and DAFFODIL-1861 to track the QName interning. - Steve On 10/23/2017 01:25 PM, Mike Beckerle wrote: > No I think 2.1.0 as we don't understand the performance implications fully > yet. > > > This should fix bug.... > > > https://opensource.ncsa.illinois.edu/jira/browse/DFDL-1773 > > [DFDL-1773] Choice ambiguous element name results in ... > <https://opensource.ncsa.illinois.edu/jira/browse/DFDL-1773> > opensource.ncsa.illinois.edu > Daffodil; DFDL-1773; Choice ambiguous element name results in failed > expression > > > -------------------------------------------------------------------------------- > *From:* Taylor Wise > *Sent:* Monday, October 23, 2017 12:43:36 PM > *To:* Mike Beckerle; Steve Lawrence; [email protected] > *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion > > Should this be pushed to 2.0.0? > > -------------------------------------------------------------------------------- > *From:* Mike Beckerle > *Sent:* Monday, October 23, 2017 12:36:57 PM > *To:* Taylor Wise; Steve Lawrence; [email protected] > *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion > > Yes. The one giant ticket about the space problem compiling (DFDL-1444) is > not > specific enough. > > -------------------------------------------------------------------------------- > *From:* Taylor Wise > *Sent:* Monday, October 23, 2017 12:28:04 PM > *To:* Steve Lawrence; [email protected]; Mike Beckerle > *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion > > I don't believe there's a ticket associated with these changes. Should I go > ahead and create one? > > -------------------------------------------------------------------------------- > *From:* Steve Lawrence <[email protected]> > *Sent:* Wednesday, October 18, 2017 8:21:02 AM > *To:* Taylor Wise; [email protected]; Mike Beckerle > *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion > On 10/17/2017 07:57 PM, Taylor Wise wrote: >> Been debugging after implementing your suggested changes. >> >> >> First thing to note, is that by moving the creation of the array out of >> getChildArray that I had to make additional calls to 'addChild' in the >> InfosetInputter. This corrected things on the 'parser' side. However, on >> the >> unparser side it became evident that this 'addChild' function cannot always >> set >> the parent. There's an invariant expecting parent to be null along the >> execution branch of the unparser side. As a result, I modified addChild to >> also >> accept a default boolean called 'setParent'. It defaults to true unless >> specified. So now when we call 'addChild' in InfosetInputter on the >> unparser >> side, we actually call 'addChild(e, false)'. > > This sounds to me like addChild is getting called twice on the same > element and so it's parent is trying to be set twice. Might need some > more investigation. Maybe the InfosetInputter needs more tweaks to > support not using getChildArray? Hard to say without seeing your > InfosetInputter changes. > > Regarding the below tests, I think your assessment is correct. The > password element doesn't exist and so we should never even try to get > the -1 index. Instead we should get something like "password element > does not exist" or something. It would be useful to have that test to > ensure we get that error message. > > An index of -1 (or anything less than 1) can definitely be caught > statically at schema compile time. Worth creating a bug for that. But > there will be times where the index is calculated so we'll still need > this runtime check. I think it's still a test worth having. > > I think you're modifications are reasonable based on what the tests are > trying to tests. Probably to proper coverage of tests we need is: > > - trying to access an array that does not exist with both valid and > invalid indicies (less than 1 and greater than array size) > - trying to access an array thta does exist but with invalid indicies > (less than 1 and greater than array size) > > So it's probably worth keeping the existing tests for the 'array does > not exist' tests and ensure we get a reasonable error message not > talking about indices. And then just make your modifications new tests. > >> >> The above correction reduced the errors from 30+ to just 2. The only two >> errors >> remaining have to do with an array index out of bounds error that's >> expected. I >> believe that it's actually an issue with the test and not the code. >> >> >> Error: >> >> edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic >> message >> "Value -1 is out of range" in any of the actual diagnostic messages: >> Runtime Schema Definition Error: Expression Evaluation Error: Child element >> {http://example.com}password does not exist. >> Schema context: hideme Location line 267 column 12 in >> file:/tmp/hiddenElem2658119394211244932.dfdl.xsd >> Data location was preceding byte 0 limit(bytes) 8 >> >> Well, password doesn't exist because we are missing the 'pw:' initiator in >> the >> tdml:document content. The code now doesn't know to create it because we >> haven't seen the initiator. At least that's what I surmise. >> >> >> Existing test: >> >> <!-- >> Test Name: arrayIndexOutOfBounds_02 >> Schema: hiddenElem >> Root: e1b >> Purpose: This test demonstrates that a proper error occurs if an >> array >> index in an expression is out of bounds >> --> >> >> <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b" >> model="hiddenElem" description="Section 23 - DFDL Expressions"> >> >> <tdml:document>p455w0rd</tdml:document> >> <tdml:errors> >> <tdml:error>Schema Definition Error</tdml:error> >> <tdml:error>expression evaluation error</tdml:error> >> <tdml:error>Value -1 is out of range</tdml:error> >> <tdml:error>with length 0</tdml:error> >> </tdml:errors> >> </tdml:parserTestCase> >> >> Schema: >> <xs:group name="hiddenData"> >> <xs:sequence dfdl:separator="|"> >> <xs:element name="password" type="xs:string" dfdl:initiator="pw:" >> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/> >> </xs:sequence> >> </xs:group> >> >> <xs:element name="e1"> >> <xs:complexType> >> <xs:sequence> >> <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/> >> <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ >> fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/> >> </xs:sequence> >> </xs:complexType> >> </xs:element> >> <xs:element name="e1b"> >> <xs:complexType> >> <xs:sequence> >> <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/> >> <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ >> fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/> >> </xs:sequence> >> </xs:complexType> >> </xs:element> >> >> >> My modification: >> >> <!-- >> Test Name: arrayIndexOutOfBounds_02 >> Schema: hiddenElem >> Root: e1b >> Purpose: This test demonstrates that a proper error occurs if an >> array >> index in an expression is out of bounds >> --> >> >> <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b" >> model="hiddenElem" description="Section 23 - DFDL Expressions"> >> >> <tdml:document>pw:p455w0rd</tdml:document> >> <tdml:errors> >> <tdml:error>Schema Definition Error</tdml:error> >> <tdml:error>expression evaluation error</tdml:error> >> <tdml:error>Value -1 is out of range</tdml:error> >> <tdml:error>with length 1</tdml:error> >> </tdml:errors> >> </tdml:parserTestCase> >> >> But honestly, should we really need to do this? Shouldn't the code know >> that -1 >> is not a valid index without having to force construction of the password >> array? >> >> I'm similarly going to have to modify this other test as well by >> constructing a >> different root that attempts to access a character in a position greater >> than >> the length of the array, unless there's something I'm missing: >> >> <!-- >> Test Name: arrayIndexOutOfBounds_01 >> Schema: hiddenElem >> Root: e1 >> Purpose: This test demonstrates that a proper error occurs if an >> array >> index in an expression is out of bounds >> --> >> >> <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1" >> model="hiddenElem" description="Section 23 - DFDL Expressions"> >> >> <tdml:document>p455w0rd</tdml:document> >> <tdml:errors> >> <tdml:error>Schema Definition Error</tdml:error> >> <tdml:error>expression evaluation error</tdml:error> >> <tdml:error>Value 1 is out of range</tdml:error> >> <tdml:error>with length 0</tdml:error> >> </tdml:errors> >> </tdml:parserTestCase> >> >> Proposed addition to defineSchema: >> >> <xs:element name="e1c"> >> <xs:complexType> >> <xs:sequence> >> <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/> >> <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ >> fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/> >> </xs:sequence> >> </xs:complexType> >> </xs:element> >> >> Proposed modification to test: >> >> <!-- >> Test Name: arrayIndexOutOfBounds_01 >> Schema: hiddenElem >> Root: e1 >> Purpose: This test demonstrates that a proper error occurs if an >> array >> index in an expression is out of bounds >> --> >> >> <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c" >> model="hiddenElem" description="Section 23 - DFDL Expressions"> >> >> <tdml:document>pw:p455w0rd</tdml:document> >> <tdml:errors> >> <tdml:error>Schema Definition Error</tdml:error> >> <tdml:error>expression evaluation error</tdml:error> >> <tdml:error>Value 10 is out of range</tdml:error> >> <tdml:error>with length 1</tdml:error> >> </tdml:errors> >> </tdml:parserTestCase> >> >> >> >> >> -------------------------------------------------------------------------------- >> *From:* Steve Lawrence <[email protected]> >> *Sent:* Tuesday, October 17, 2017 11:04:03 AM >> *To:* Taylor Wise; [email protected]; Mike Beckerle >> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion >> I don't think a DIArray could be created earlier and things appended to >> it. But, currently, I think an empty DIArray could be created earlier. >> >> And I think this is really just a problem with getChildArray always >> creating DIArrays. What I suspect happened is that we had an array >> following by something that accessed that array, e.g.: >> >> <xs:sequence> >> <xs:element name="arr" maxOccurs="unbounded" ... /> >> <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" /> >> </xs:sequence> >> >> arr was slot 1, count was slot 2. >> >> The setChildArray function used to just do something like >> >> _lastSlotAdded = slot >> >> Which worked in the case of adding a new array--we add a new array, so >> the slot it was added in was the last slot added. >> >> But when fn:count was called, _lastSlotAdded was 2 (from the count >> element). fn:count ended up calling getChildArray, which created a >> DIArray and called setChildArray, which overwrote _lastSlotAdded to the >> slot of the array, i.e. 1. So even though the array was finished and >> couldn't be modified, the use of getChildArray in DPath ended up setting >> _lastSlotAdded to the wrong value. Using max was a way to fix this so we >> couldn't set it to an earlier value. >> >> So, now, because getChildArray create elements, it sounds like things >> could be created out of order. Rather than inserting an empty DIArray >> into a slot, it will added it to the end and potentially mess things up. >> This says to me we definitely want to modify getChildArray to not create >> DIArrays. It should either return the array if it exists, or throw an >> Infoset exception, very similar to getChild. Then we don't have to worry >> about DIArray's being created at the wrong time. The only thing that >> should create a DIArray is addChild. >> >> >> On 10/17/2017 10:13 AM, Taylor Wise wrote: >>> Hmm, very interesting. This would greatly simplify things. But what about >>> this >>> code snippet from before my changes: >>> >>> >>> // final def setChildArray(slot: Int, arr: DIArray) { >>> // Assert.invariant(_slots(slot) eq null) >>> // _slots(slot) = arr >>> // // >>> // // Because arrays are created if not existing, an expression like >>> ../A[1] >>> // // can cause array A to come into existence (though empty) and A >>> might be >>> // // part of an expression that is reaching backward into earlier >>> data (when >>> // // parsing) so that we might be creating an array earlier in the >>> slots >>> // // of this element. Hence, the slot being passed here is not >>> necessarily >>> // // the last slot added. It might be before it. >>> // // >>> // _lastSlotAdded = math.max(slot, _lastSlotAdded) >>> // } >>> Given your statement: >>> >>> >>> "the issue is determining if the DIArray exists or not. Everything >>> after that is the same. And since children are inserted in the order >>> that we see them, a DIArray will always be at the end of the children >>> ArrayBuffer if it exists. If the DINode at the end of the children >>> ArrayBuffer is not this array, it needs to be created." >>> >>> >>> To me it seems to imply that a DIArray can come into existence without any >>> contents. But then be appended to at a later time. In which case it's >>> possible >>> that this DIArray then is not the last item appended. It could be >>> somewhere >>> earlier and then need to be added to. >>> >>> -------------------------------------------------------------------------------- >>> *From:* Steve Lawrence <[email protected]> >>> *Sent:* Tuesday, October 17, 2017 7:20:42 AM >>> *To:* [email protected]; Mike Beckerle; Taylor Wise >>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion >>> Actually, I think we're over complicating the array situation. I think >>> we can avoid needing a unique ID and I think we can also avoid needing >>> to always insert arrays into a HashMap. >>> >>> When addChild is called to add an DIElement to an array, there are two >>> cases we need to think about: >>> >>> 1) The DIArray does not exist and it must be created. The >>> DIElement is appended to its contents. >>> >>> 2) The DIArray does exist and we just need to find it. The >>> DIElement is appended to its contents. >>> >>> So the issue is determining if the DIArray exists or not. Everything >>> after that is the same. And since children are inserted in the order >>> that we see them, a DIArray will always be at the end of the children >>> ArrayBuffer if it exists. If the DINode at the end of the children >>> ArrayBuffer is not this array, it needs to be created. So addChild >>> just becomes something like this: >>> >>> def addChild(e: DIElement) { >>> if (e.erd.isArray) { >>> if (children.isEmpty || children.last.erd != e.erd) { >>> // no children, or last child is not a DIArray for >>> // this element, create the DIArray >>> children.append(new DIArray(...)) >>> } >>> // the array is now always last, add the new child to it >>> children.last.append(e) >>> } else { >>> children.append(e) >>> } >>> } >>> >>> So we no longer need a unique ID or a tuple for the HashMap key. The >>> fast HashMap lookup only needs to work on the name of the element, which >>> allows query-style expressions to work in the future. And we now only >>> need to insert DIArray's into the fast HashMap if they are used in >>> expressions addChild no longer needs getChildArray, minimizing hash >>> look ups. >>> >>> Related, I'm wondering if getChildArray should change as well so that it >>> does not create a DIArray if it doesn't exist? I think it seems >>> reasonable that only addChild should create new DIArrays. And if >>> something like DPath calls getChildArray, we probably don't want to end >>> up creating an empty DIArray if it doesn't exist. It should probably >>> just throw an InfosetNoSuchChildElementException if the array doesn't >>> exist, just like getChild. >>> >>> - Steve >>> >>> On 10/16/2017 12:02 PM, Steve Lawrence wrote: >>>> The ID might cause some issues when we want to support query-style >>>> expressions. For example, say we have a schema like this: >>>> >>>> <sequence> >>>> <xs:element name="array" maxOccurs="unbounded" ... /> >>>> <xs:element name="element" ... /> >>>> <xs:element name="array" maxOccurs="unbounded" ... /> >>>> </xs:sequence> >>>> >>>> So two arrays with the same name separated by a single element. Each >>>> element would have a unique ID, so would be inserted in the hash table >>>> as something like >>>> >>>> ("array", 1) -> Seq(DIArray) >>>> ("element", 2) -> Seq(DISimple) >>>> ("array", 3) -> Seq(DIArray) >>>> >>>> A query style expression that accesses the "array" elements (e.g. >>>> fn:count(./array)) would have trouble finding both arrays to get the >>>> full count since the keys for the two arrays are the same. >>>> >>>> Perhaps one option is to keep the name -> DINode map type, so we'd end >>>> up with something like this: >>>> >>>> "array" -> Seq(DIArray, DIArray) >>>> "element" -> Seq(DISimple) >>>> >>>> And then just iterator of the Seq until we find DIArray with the right >>>> ID? That's could have a performance hit if there were lots of things >>>> with the same name, but I suspect that won't be super common. >>>> >>>> Alternatively, we could relook at chaning the InfosetImpl API so that >>>> when an DIArray is created, it's store in the PState for quick access, >>>> and getChildArray would essentially go away. >>>> >>>> - Steve >>>> >>>> >>>> On 10/13/2017 02:14 PM, Mike Beckerle wrote: >>>>> I think the key has to change to include name and namespace of element, >>>>> and also an additional integer to disambiguate the first a element from >>>>> the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any >>>>> value at all, so long as they are distinct. >>>>> >>>>> >>>>> Let's make sure another case works. E.g., ordered sequence like the one >>>>> you have here, another element to separate it from an unordered sequence >>>>> that also contains elements named a. >>>>> >>>>> >>>>> However, an unordered sequence can't have more than 1 element named a, so >>>>> if every element a declaration gets an additional integer, the one in the >>>>> unordered sequence will also get one, and that will behave properly. >>>>> >>>>> >>>>> Is this integer the same breadth-first ordering thing we discussed >>>>> elsewhere? I expect it is. >>>>> >>>>> ________________________________ >>>>> From: Taylor Wise >>>>> Sent: Friday, October 13, 2017 1:55:22 PM >>>>> To: Steve Lawrence; [email protected]; Mike Beckerle >>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion >>>>> >>>>> >>>>> I'm down to a single error: test_dfdlPosition3 >>>>> >>>>> >>>>> java.lang.Exception: >>>>> Comparison failed. >>>>> Expected >>>>> >>>>> <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f> >>>>> Actual >>>>> >>>>> <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f> >>>>> Differences were (path, expected, actual): >>>>> (f,<cnt2>2</cnt2>,<a><v>1</v></a>) >>>>> at >>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192) >>>>> >>>>> <xs:element name="f"> >>>>> <xs:complexType> >>>>> <xs:sequence> >>>>> <xs:element name="cnt1" type="xs:int" >>>>> dfdl:lengthKind="delimited" dfdl:terminator=";" /> >>>>> <xs:element name="a" minOccurs="0" maxOccurs="unbounded" >>>>> dfdl:occursCountKind="expression" dfdl:occursCount="{ >>>>> xs:int(../ex:cnt1) }"> >>>>> <xs:complexType> >>>>> <xs:sequence> >>>>> <xs:element name="v" type="xs:int" >>>>> dfdl:inputValueCalc="{ dfdl:occursIndex() }" /> >>>>> </xs:sequence> >>>>> </xs:complexType> >>>>> </xs:element> >>>>> <xs:element name="cnt2" type="xs:int" >>>>> dfdl:lengthKind="delimited" /> >>>>> <xs:element name="a" minOccurs="0" maxOccurs="unbounded" >>>>> dfdl:occursCountKind="expression" dfdl:occursCount="{ >>>>> xs:int(../ex:cnt2) }"> >>>>> <xs:complexType> >>>>> <xs:sequence> >>>>> <xs:element name="v" type="xs:int" >>>>> dfdl:inputValueCalc="{ dfdl:occursIndex() }" /> >>>>> </xs:sequence> >>>>> </xs:complexType> >>>>> </xs:element> >>>>> </xs:sequence> >>>>> </xs:complexType> >>>>> </xs:element> >>>>> >>>>> >>>>> >>>>> I believe what's happening, from stepping through, is that the >>>>> getChildArray code is confused as to where to add the second set of <a/> >>>>> elements. Below are the values of the structures at the point where >>>>> <cnt2/> has been added: >>>>> >>>>> ChildNodes aka Insertion Order >>>>> ==== >>>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), >>>>> DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), >>>>> DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, >>>>> One(8)unk))) >>>>> >>>>> FastLookup >>>>> ==== >>>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, >>>>> One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 >>>>> valueLen(One(0)unk, One(8)unk))), a -> >>>>> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), >>>>> DIComplex(name='a'))))) >>>>> >>>>> So when we go to add the next array 'a', it sees that 'a' is already in >>>>> the fastLookup and appends it there. This adds it to the existing >>>>> DIArray in the insertion order list. Really, it should create a new >>>>> DIArray inserting it after <cnt2/> in the insertion order list. And >>>>> inserting it after the DIArray in the fastLookup Hash: a -> >>> Seq(DIArray(a, ...), DIArray(a, ...)) >>>>> >>>>> How do we fix this? >>>>> >>>>> Because this is an 'ordered' sequence, can we look at the >>>>> slotIndexInParent? Isn't this the position in the sequence definition as >>>>> defined in the schema above? This should differentiate between the two >>>>> 'a' arrays. And then for 'unordered' sequences, this shouldn't matter. >>>>> We shouldn't run into two or more members with the same >>> name and namespace. Correct? >>>>> >>>>> This also means that the 'key' will have to change to include the >>>>> namespace in addition to the element name. >>>>> >>>>> >>>>> >>>>> ________________________________ >>>>> From: Steve Lawrence <[email protected]> >>>>> Sent: Friday, October 13, 2017 8:45:06 AM >>>>> To: [email protected]; Taylor Wise; Mike Beckerle >>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion >>>>> >>>>> On 10/12/2017 08:19 PM, Taylor Wise wrote: >>>>>> Been debugging using the TestSimpleTypes tests. I'm down to just 1 >>>>>> failure. Expected results are slightly off. I've got an additional >>>>>> <text/> in the result. So, getting closer. Stopping for the night. >>>>>> >>>>>> >>>>>> I'm not so sure that we need the _lastSlotAdded variable anymore. I >>>>>> changed it to be just a def that reflects the current length of >>>>>> childNodes (ArrayBuffer). >>>>>> z >>>>>> >>>>>> I'm betting my current issue is within the restore functions I modified, >>>>>> but I'll continue looking at that in the morning. I would think that >>>>>> 'restoring' would simply be copying the previous state... unless I'm >>>>>> misunderstanding something. >>>>>> >>>>>> >>>>>> abstract override def restoreInto(e: DIElement) { >>>>>> val c = e.asInstanceOf[DIComplex] >>>>>> c.restoreFrom(this) >>>>>> //c._lastSlotAdded = this._lastSlotAdded >>>>>> super.restoreInto(e) >>>>>> } >>>>>> >>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){ >>>>>> // Shouldn't I be able to just say something like... >>>>>> // >>>>>> childNodes.clear() >>>>>> childNodes.append(cs.childNodes) >>>>>> fastLookup.clear() >>>>>> cs.fastLookup.toSeq.foreach( (name, seqDINodes) => >>>>>> fastLookup.put(name, seqDINodes) ) >>>>>> } >>>>> >>>>> This is going to be somewhat inefficient. This essentially deletes and >>>>> recreates the entire childNodes array and fastLookup hashmap everytime >>>>> we backtrack. This could be pretty back if we childNodes had 1000 >>>>> elements and we just wanted to backtrack 1. Instead we just want to >>>>> remove the elements that are being backtracked. >>>>> >>>>> So I think you still need a concept of _lastSlotAdded. But rather than >>>>> the lastSlotAdded, it's just the length of the children (maybe call it >>>>> _numChildren?). I would expect restoreFrom for DIComplex to look >>>>> something like this (very similar to the existing restoreFrom): >>>>> >>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) { >>>>> var i = childNodes.length - 1 >>>>> >>>>> // for each child we will remove, remove it from the fastLookup map. >>>>> // this should always be the last element in the hashmap seq >>>>> while (i >= cs._numChildren) { >>>>> var childToRemove = childNodes(i) >>>>> var fastSeq = fastLookup(childToRemove.name) >>>>> if (fastSeq.length == 1) >>>>> // last one, remove the whole key entry >>>>> fastLookup.remove(childToRemove.name) >>>>> else >>>>> // not the last one, just drop the end >>>>> fastSeq.dropRight(1) >>>>> i -= 1 >>>>> } >>>>> >>>>> // now just quickly remove all those children >>>>> childNodes.reduceToSize(cs._numChildren) >>>>> _numChildren = cs._numChildren >>>>> >>>>> if (cs._arraySize.isDefined) { >>>>> childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get) >>>>> } >>>>> } >>>>> >>>>> The logic here is very similar to the CURRENT code, just is a little >>>>> more simple since it can make certain assumptions. Like removing >>>>> something form the fastSeq just means removing the last one >>>>> (dropRight(1)), and the existing _arraySize handles the special case for >>>>> DIArrays to simplify that logic. >>>>> >>>>> Note that we might want to rethink about the type fastLookup value. >>>>> Rather than the value being a Seq, it probably wants to be some sort of >>>>> mutable Seq that has constant time append dropRight. Maybe another >>>>> ArrayBuf is the right data structure here? That would allow something like >>>>> >>>>> fastLookup.reduceToSize(fastSeq.length - 1) >>>>> >>>>> >>>>> >>>>> >>>>>> // ORIGINAL >>>>>> // final def restoreFrom(cs: DIComplexSharedImplMixin) { >>>>>> // Assert.invariant(_arraySize == MaybeInt.Nope) >>>>>> // >>>>>> // var i = _lastSlotAdded >>>>>> // while (i > cs._lastSlotAdded) { >>>>>> // _slots(i) = null >>>>>> // i -= 1 >>>>>> // } >>>>>> // _lastSlotAdded = cs._lastSlotAdded >>>>>> // >>>>>> // // check invariant. All slots after last slot added >>>>>> // // should be null. >>>>>> // { >>>>>> // var i = _lastSlotAdded + 1 >>>>>> // while (i < _slots.length) { >>>>>> // Assert.invariant(_slots(i) eq null) >>>>>> // i = i + 1 >>>>>> // } >>>>>> // } >>>>>> // >>>>>> // if (cs._arraySize.isDefined) { >>>>>> // >>>>>> _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get) >>>>>> // } >>>>>> // } >>>>>> >>>>>> // CURRENT, lot of logic here. Wondering if it's even necessary, thus >>>>>> the question above >>>>>> // >>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) { >>>>>> Assert.invariant(_arraySize == MaybeInt.Nope) >>>>>> >>>>>> var i = _lastSlotAdded >>>>>> while (i > cs._lastSlotAdded) { >>>>>> val lastNode = childNodes.last >>>>>> val name = lastNode.namedQName.toQNameString >>>>>> >>>>>> lastNode match { >>>>>> case arr: DIArray if (arr.length > 1) => { >>>>>> // No need to delete the array, just the last DINode in it >>>>>> arr.pop() >>>>>> } >>>>>> case arr: DIArray if (arr.length == 1) => { >>>>>> // Last DINode in array, remove the array entirely >>>>>> // >>>>>> arr.pop() >>>>>> // >>>>>> // Update insertion order list >>>>>> // >>>>>> childNodes.remove(childNodes.length - 1) >>>>>> // >>>>>> // Update Fast Lookup >>>>>> // >>>>>> val seq = fastLookup(name) >>>>>> if (seq.length <= 1) { >>>>>> // Last DINode with THIS name, remove it entirely. >>>>>> // >>>>>> fastLookup.remove(name) >>>>>> } else { >>>>>> // Not the last DINode with THIS name, just remove this >>>>>> particular one. >>>>>> // >>>>>> val idx = seq.lastIndexWhere(node => >>>>>> node.isInstanceOf[DIArray]) >>>>>> val result = seq diff Seq(lastNode) >>>>>> fastLookup.update(name, result) >>>>>> } >>>>>> } >>>>>> case _ => { >>>>>> // >>>>>> // Update insertion order list >>>>>> // >>>>>> childNodes.remove(childNodes.length - 1) >>>>>> // >>>>>> // Update Fast Lookup >>>>>> // >>>>>> val seq = fastLookup(name) >>>>>> if (seq.length <= 1) { >>>>>> // Last DINode with THIS name, remove it entirely >>>>>> // >>>>>> fastLookup.remove(name) >>>>>> } else { >>>>>> // Not the last DINode with THIS name. Just remove this one! >>>>>> // >>>>>> val idx = seq.lastIndexWhere(node => >>>>>> !node.isInstanceOf[DIArray]) >>>>>> val result = seq diff Seq(lastNode) >>>>>> fastLookup.update(name, result) >>>>>> } >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> } >>>>>> >>>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet >>>>>> though. Baby steps. >>>>>> java.lang.Exception: >>>>>> Comparison failed. >>>>>> Expected >>>>>> >>>>>> <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01> >>>>>> Actual >>>>>> >>>>>> <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01> >>>>>> Differences were (path, expected, actual): >>>>>> (hb_root01,List(),List(<text/>)) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673) >>>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673) >>>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122) >>>>>> at >>>>>> edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>>>> at >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>>>> at >>>>>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) >>>>>> at >>>>>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) >>>>>> at >>>>>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) >>>>>> at >>>>>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) >>>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) >>>>>> at >>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) >>>>>> at >>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) >>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) >>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) >>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) >>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) >>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) >>>>>> at >>>>>> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) >>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382) >>>>>> at >>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ________________________________ >>>>>> From: Steve Lawrence <[email protected]> >>>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM >>>>>> To: [email protected]; Mike Beckerle; Taylor Wise >>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion >>>>>> >>>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote: >>>>>>> You have to hash lookup the name to append to an array if you're just >>>>>>> appending one thing at a time starting from the slot/hashID. The >>>>>>> current Infoset API hides the construction of the DIArray, and only >>>>>>> provides access via the slot/hashID, so there's no way to avoid the >>>>>>> hash per each append. >>>>>> >>>>>> The hashmap and hashlookups should really only be needed for elements >>>>>> that are used in expressions. We came up with that optimization so that >>>>>> we could remove the vast majority of hash lookups. Due to how >>>>>> getChildArray works, that optimization can't occur. I'm not sure how >>>>>> much this will affect performance, but there are ways to avoid the need >>>>>> for a hash lookup for arrays not used in expressions. The removal of >>>>>> DIArray was one. Maybe API/parser changes is better. >>>>>> >>>>>>> But that's fixable if we want. Probably the API to infoset needs to >>>>>>> allow appending to an array more efficently, so that a loop like a >>>>>>> repeat combinator can be directly holding onto the DIArray's >>>>>>> arrayBuffer appending objects as the parser returns them. But I would >>>>>>> not make this change as yet. First things first. Don't optimize >>> without profiling info, etc. >>>>>>> >>>>>>> >>>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the >>>>>>> only game in town. The XML decision to not have arrays, just adjacent >>>>>>> things that happen to have the same name, has all sorts of negative >>>>>>> performance implications. >>>>>> >>>>>> The removal of a DIArray doesn't necessarily mean we don't have a >>>>>> concept of an arrays. Every element added to Complex child arrayBuf >>>>>> still knows if it part of an array via ERD.isArray. And comparing the >>>>>> current DIElements ERD to the previous and next is enough to determine >>>>>> where array starts and ends. From there, we can support any model that >>>>>> requires a concept of an array. I'm not saying we should remove the >>>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to >>>>>> have a special DIArray. I'll agree that it does make the logic easier to >>>>>> get it to a different model, but other things are more complicated. >>>>>> >>>>>>> Consider this: we could create DIArrays of elements of simple type that >>>>>>> don't even create a DISimple node for each array element. The values >>>>>>> could be stored in the DIArray's arrayBuffer directly, eliminating all >>>>>>> that overhead of a DISimple node. (This was the original intent for >>>>>>> DIArray, but was punted to get it working sooner.) >>> The opportunity to do this is entirely lost if you don't have DIArray, but >>> just >>> a bunch of adjacent DIElements that happen to share ERD. >>>>>> >>>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't >>>>>> get around to doing this optimization, we're taking multiple hits. We're >>>>>> still allocating many DISimples plus a DIArray, plus the hit from the >>>>>> hashmap. >>>>>> >>>>>>> More speculatively, when we someday implement incremental array >>>>>>> trimming so that streaming parse/unparse can drop most array elements >>>>>>> once they've been delivered to consumer, well then we will definitely >>>>>>> want a DIArray object to abstract the complexity of this trim behavior >>>>>>> and the underlying representation it requires. >>>>>> >>>>>> It might simplify this, but I'm not entirely convinced. I think the >>>>>> trimming of unused Infoset elements is going to have some complex >>>>>> heuristics regardless. Adding an extra heuristic for arrays won't >>>>>> necessarily make this easier. But this is just speculation. And again, >>>>>> we would still know elements are part of an array, it just takes a >>>>>> couple extra comparisons to determine start/end. But I do agree that >>>>>> removal of an an entire array is much easier with a DIArray concept. >>>>>> >>>>>> >>>>>> >>>>>> It seems to me these are good enough arguments for keeping DIArray >>>>>> around, but it might be worth thinking a little bit about what changes >>>>>> would be necessary to the API/parser to not need this hash lookup. If >>>>>> it's a minor change, it could reduce a lot of hash lookups, especially >>>>>> for large arrays, which are not uncommon. >>>>>> >>>>>> >>>>>>> >>>>>>> ________________________________ >>>>>>> From: Steve Lawrence <[email protected]> >>>>>>> Sent: Wednesday, October 11, 2017 12:21 PM >>>>>>> To: Taylor Wise; [email protected] >>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion >>>>>>> >>>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote: >>>>>>>> >>>>>>>> What you'll notice is that the current implementation will construct >>>>>>>> the DIArray >>>>>>>> if it doesn't already exist and append to it if it does already exist. >>>>>>>> But it >>>>>>>> needs the 'slot' number in order to check this. The slot number is >>>>>>>> obtained >>>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this >>>>>>>> concept of slots >>>>>>>> entirely, how can we accomplish this? Will we have to do a named >>>>>>>> lookup then >>>>>>>> every time we want to append to a DIArray? >>>>>>> I think you're right. The way things work right now, anytime we add to a >>>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not >>>>>>> limited to just DPath. >>>>>>> >>>>>>> Maybe this is another argument for removing the concept of a DIArray? We >>>>>>> still need to know if a DIElement is part of an array for the >>>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray >>>>>>> element should just means creating a new DISimple/DIComplex and adding >>>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup >>>>>>> Seq? This would also simplify some query style logic, e.g. fn:count. >>>>>>> >>>>>>> Alternatively, we could modify the parser logic so that when it's >>>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or >>>>>>> Parser) and appends straight to that without having to ask the parent >>>>>>> DIComplex for the DIArray? Not sure how big of a change that would be >>>>>>> though. >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Current Implementation: >>>>>>>> >>>>>>>> override def addChild(e: InfosetElement): Unit = { >>>>>>>> >>>>>>>> if (e.runtimeData.isArray) { >>>>>>>> >>>>>>>> // Make sure there is an array to accept the child >>>>>>>> >>>>>>>> val arr = getChildArray(e.runtimeData) // creates if doesn't exist >>>>>>>> >>>>>>>> Assert.invariant(arr ne null) >>>>>>>> >>>>>>>> arr.append(e) >>>>>>>> >>>>>>>> } else { >>>>>>>> >>>>>>>> _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode] >>>>>>>> >>>>>>>> _lastSlotAdded = e.runtimeData.slotIndexInParent >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> e.setParent(this) >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> final def getChildArray(childERD: ElementRuntimeData): InfosetArray >>>>>>>> = { >>>>>>>> Assert.usage(childERD.isArray) >>>>>>>> getChildArray(childERD.dpathElementCompileInfo) >>>>>>>> } >>>>>>>> >>>>>>>> final def getChildArray(info: DPathElementCompileInfo): >>>>>>>> InfosetArray = { >>>>>>>> Assert.usage(info.isArray) >>>>>>>> val slot = info.slotIndexInParent >>>>>>>> getChildArray(slot) >>>>>>>> } >>>>>>>> >>>>>>>> private def getChildArray(slot: Int): InfosetArray = { >>>>>>>> val slotVal = _slots(slot) >>>>>>>> if (slotVal ne null) >>>>>>>> slotVal match { >>>>>>>> case arr: DIArray => slotVal.asInstanceOf[DIArray] >>>>>>>> case _ => Assert.usageError("not an array") >>>>>>>> } >>>>>>>> else { >>>>>>>> val arrayERD = erd.childERDs(slot) >>>>>>>> // slot is null. There isn't even an array object yet. >>>>>>>> // create one (it will have zero entries) >>>>>>>> val ia = new DIArray(arrayERD, this) >>>>>>>> // no array there yet. So we have to create one. >>>>>>>> setChildArray(slot, ia) >>>>>>>> ia >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -------------------------------------------------------------------------------- >>>>>>>> *From:* Steve Lawrence <[email protected]> >>>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM >>>>>>>> *To:* [email protected]; Taylor Wise >>>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design >>>>>>>> Discussion >>>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote: >>>>>>>>> Goal: Remove concept of slots from InfosetImpl >>>>>>>>> >>>>>>>>> Reasons: >>>>>>>>> 1. So that element/node lookup can be name based. >>>>>>>>> 2. Ease re-enabling of Unordered Sequences. >>>>>>>>> 3. Ease enabling of 'query-style' expressions that return more >>>>>>>>> than >>>>>>>>> one node. >>>>>>>>> >>>>>>>>> An Infoset is a document (DIDocument) and the document contains >>>>>>>>> exactly >>>>>>>>> one element (DIElement) as its root. These elements can be simple >>>>>>>>> (DISimple) or complex (DIComplex). Simple elements do not contain >>>>>>>>> children, complex elements do. DIArray represents an array (maxOccurs >>>>>>>>>> 1). A DIArray shall always be a child of DIComplex. All of these >>>>>>>>> inherit from DINode. >>>>>>>>> >>>>>>>>> Currently DIComplex models its children as an Array of DINodes. The >>>>>>>>> problem with the current implementation is that there's no way for us >>>>>>>>> to >>>>>>>>> just ask for an element by name without walking the structure. >>>>>>>>> >>>>>>>>> The solution involves modifying DIComplex to represent its children >>>>>>>>> both >>>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String, >>>>>>>>> mutable.Seq[DINode]]. The ArrayBuffer shall give us the necessary >>>>>>>>> insertion ordering information we need regarding the children as well >>>>>>>>> as >>>>>>>>> constant time indexing access to the child in question. The >>>>>>>>> ArrayBuffer >>>>>>>>> also has the added benefit of growing and shrinking as objects are >>>>>>>>> added >>>>>>>>> and removed. While the HashMap will allow us to associate a name >>>>>>>>> (String) with DINodes that have that name thus providing a fast named >>>>>>>>> lookup. >>>>>>>>> >>>>>>>>> This named lookup can be optimized further by only inserting into the >>>>>>>>> HashMap those elements that we know are used in expressions. These are >>>>>>>>> the elements that would most likely benefit from a fast named lookup >>>>>>>>> and >>>>>>>>> will allow us to avoid additional overhead of hashing strings and >>>>>>>>> allocating/appending to a sequence. >>>>>>>>> >>>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is >>>>>>>>> that >>>>>>>>> it's possible for the following: >>>>>>>>> >>>>>>>>> <element name="a" .../> >>>>>>>>> <element name="a" maxOccurs="unbounded"..../> >>>>>>>>> <element name="b".../> >>>>>>>>> <element name="a" ..../> >>>>>>>>> >>>>>>>>> Each element named 'a' would have an associated DINode. Because there >>>>>>>>> are multiple DINodes named 'a', we need to represent them in the >>>>>>>>> HashMap >>>>>>>>> using the same key 'a'. To do this, we add all instances of 'a' to a >>>>>>>>> Seq[DINode]. In this particular case, the HashMap would look as >>>>>>>>> follows: >>>>>>>>> >>>>>>>>> "a" -> mutable.Seq(DIElement, DIArray, DIElement) >>>>>>>>> "b" -> mutable.Seq(DIElement) >>>>>>>>> >>>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look >>>>>>>>> as >>>>>>>>> follows: >>>>>>>>> >>>>>>>>> mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement) >>>>>>>>> >>>>>>>>> Back tracking shall be accomplished by going to the last item in the >>>>>>>>> DIComplex ArrayBuffer and removing it. >>>>>>>>> 1. If the last item is a DIElement, we simply delete it from the >>>>>>>>> ArrayBuffer and then remove it also from the HashMap. >>>>>>>>> 2. If the last item is a DIArray, we then have to look at the >>>>>>>>> contents of the DIArray. >>>>>>>>> a. If the DIArray has more than one entry, remove the last >>>>>>>>> entry. Then update the associated entry in the ArrayBuffer and >>>>>>>>> HashMap >>>>>>>>> with the modified DIArray. >>>>>>>> >>>>>>>> I think the second part of this is actually unnecessary. Since the >>>>>>>> DIArray is mutable, we only need to remove its last child. The entry in >>>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The >>>>>>>> only difference is the contents of the DIArray will be one smaller. >>>>>>>> >>>>>>>>> b. If the DIArray has a single entry then we simply remove >>>>>>>>> DIArray from ArrayBuffer and the HashMap. >>>>>>>>> >>>>>>>>> <element name="root> >>>>>>>>> <sequence ../> >>>>>>>>> <element name="a" .../> >>>>>>>>> <element name="a" maxOccurs="unbounded"> >>>>>>>>> ... >>>>>>>>> <element name="b" .../> >>>>>>>>> <element name="c" .../> >>>>>>>>> ... >>>>>>>>> </element> >>>>>>>>> </sequence> >>>>>>>>> </element> >>>>>>>>> >>>>>>>>> If the HashMap of DIComplex-root looks like the following: >>>>>>>>> "a" -> mutable.Seq(DIElement, DIArray) >>>>>>>>> >>>>>>>>> ArrayBuffer of DIComplex-root looks like the following: >>>>>>>>> 0 -> DIElement("a") >>>>>>>>> 1 -> DIArray("a") >>>>>>>>> >>>>>>>>> Where DIArray children ArrayBuffer looks like: >>>>>>>>> 0 -> DIElement("b") >>>>>>>>> 1 -> DIElement("c") >>>>>>>>> >>>>>>>>> Given the initial example if we backtrack(1): // purge one item >>>>>>>>> DIArray: >>>>>>>>> 0 -> DIElement("b") // removed one item from DIArray >>>>>>>>> >>>>>>>>> HashMap DIComplex-root: >>>>>>>>> "a" -> mutable.Seq(DIElement, DIArray) >>>>>>>>> >>>>>>>>> ArrayBuffer DIComplex-root: >>>>>>>>> 0 -> DIElement("a") >>>>>>>>> 1 -> DIArray("a") >>>>>>>>> >>>>>>>>> Given the initial example if we backtrack(2): // purge two items >>>>>>>>> DIArray: >>>>>>>>> Empty >>>>>>>>> >>>>>>>>> HashMap DIComplex-root: >>>>>>>>> "a" -> mutable.Seq(DIElement) // Because DIArray had two >>>>>>>>> items >>>>>>>>> // removed and is now empty >>>>>>>>> >>>>>>>>> ArrayBuffer DIComplex-root: >>>>>>>>> 0 -> DIElement("a") >>>>>>>>> >>>>>>>>> === >>>>>>>>> >>>>>>>>> class DIComplex { >>>>>>>>> def nSlots = erd.nChildSlots >>>>>>>> >>>>>>>> I think the concept of nSlots and nChildSlots should go away >>>>>>>> completely. >>>>>>>> The children ArrayBuffer should just grow as elements are added and >>>>>>>> shrink as they are backtracked. It might make sense to have a tunable >>>>>>>> the determines the initial size of the children Array, but the Scala >>>>>>>> defaults for initializing and growing a mutable array might already be >>>>>>>> reasonable. >>>>>>>> >>>>>>>>> val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order >>>>>>>>> lazy val fastLookup: HashMap[String, mutable.Seq[DINode]] >>>>>>>> >>>>>>>> One thing that might make this change easier is to not worry about the >>>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is >>>>>>>> really only necessary to support query style expressions, but that >>>>>>>> comes >>>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but >>>>>>>> different types, etc.). There currently exist restrictions that do not >>>>>>>> allow query style expressions, so this Seq should always be a size of 1 >>>>>>>> until remove those restrictions. >>>>>>>> >>>>>>>>> >>>>>>>>> def addChild(node: DINode) { >>>>>>>>> children += node >>>>>>>>> fastLookup(node.name).append(node) >>>>>>>>> } >>>>>>>>> >>>>>>>>> def getChild(name: String): immutable.Seq[DINode] { >>>>>>>>> fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name) >>>>>>>>> } >>>>>>>>> >>>>>>>>> def backtrack(int n) { >>>>>>>>> for (i = 0; i < n; i++) { >>>>>>>>> val endIdx = children.size -1 >>>>>>>>> val child = children.remove(endIdx) // children.pop() >>>>>>>>> >>>>>>>>> fastLookup.get(child.name) match { // >>>>>>>>> fastLookup(child.name).pop() >>>>>>>>> case None => // do nothing >>>>>>>>> case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen? >>>>>>>>> case Some(nodesSeq) => { >>>>>>>>> val lastEntry = nodesSeq.last >>>>>>>>> lastEntry match { >>>>>>>>> case arr: DIArray => { >>>>>>>>> // If the array has >= 2 items, remove last one and >>>>>>>>> update >>>>>>>>> nodesSeq >>>>>>>>> // with this array: >>>>>>>>> // val lastIdx = nodesSeq.length - 1 >>>>>>>>> // arr.remove(arr.size - 1) >>>>>>>>> // fastLookup.update(child.name, >>>>>>>>> nodesSeq.update(lastIdx, >>>>>>>>> arr)) >>>>>>>>> // >>>>>>>>> // If the array has 1 item, remove the array from >>>>>>>>> nodesSeq >>>>>>>>> // fastLookup.update(child.name, nodesSeq.dropRight(1)) >>>>>>>>> } >>>>>>>>> case _ => fastLookup.update(child.name, >>>>>>>>> nodesSeq.dropRight(1)) >>>>>>>>> } >>>>>>>>> >>>>>>>>> } >>>>>>>>> } >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> class DIArray { >>>>>>>>> private val initialSize = >>>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10 >>>>>>>>> protected final val _contents = new >>>>>>>>> ArrayBuffer[DIElement](initialSize) >>>>>>>>> >>>>>>>>> def append(ie: InfosetElement): Unit = { >>>>>>>>> _contents.append(ie.asInstanceOf[DIElement]) >>>>>>>>> ie.setArray(this) >>>>>>>>> } >>>>>>>>> >>>>>>>>> // >>>>>>>>> // For backtracking purposes >>>>>>>>> // >>>>>>>>> def remove(n: Int): DIElement = { >>>>>>>>> _contents.remove(n) >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> === >>>>>>>>> >>>>>>>>> To facillitate a breadth-first ordering scheme we shall have to >>>>>>>> >>>>>>>> That this is only necessary for unordered sequences, where the >>>>>>>> insertion >>>>>>>> order in the ArrayBuffer differs from the display order, correct? If >>>>>>>> that's true, we definitely want to make sure to only do the below >>>>>>>> sorting when an element contains an unordred sequence. >>>>>>>> >>>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD). It >>>>>>>>> shall be called childElementDeclPosition and pertains to the static >>>>>>>>> order of the element children of a complex type element. When sorting >>>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching >>>>>>>>> from the DINode to the coresponding ERD. Any array shall be a single >>>>>>>>> entry in the sequence that needs to be sorted. All elements in this >>>>>>>>> array will have the same childElementDeclPosition. >>>>>>>>> >>>>>>>>> The difficulty in implementing this is that an Element's >>>>>>>>> dsom.ComplexType can have groups within the groups before finding >>>>>>>>> actual >>>>>>>>> element children. Therefore in the Document Schema Object Model >>>>>>>>> (DSOM) >>>>>>>>> the model group object shall have a startingChildElementDeclPosition >>>>>>>>> and >>>>>>>>> an endingChildElementDeclPosition. >>>>>>>>> >>>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor >>>>>>>>> for the Term. >>>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is >>>>>>>>> equal to the startingChildElementDeclPosition. 3. If the term is a >>>>>>>>> model group, then its first child term's >>>>>>>>> startingChildElementDeclPosition is equal to its own >>>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's >>>>>>>>> startingChildElementDeclPosition is the prior term's >>>>>>>>> endingChildElementDeclPosition + 1. >>>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the >>>>>>>>> endingChildElmentDeclPosition of its last child term. >>>>>>>>> >>>>>>>> >>>>>>>> Looks good to me. +1. This should definitely make implementation of >>>>>>>> unordered sequences and query style expressions easier. Removal of the >>>>>>>> slot concept should also help remove concepts that prevent faster >>>>>>>> schema >>>>>>>> compilation. >>>>>>>> >>>>>>>> - Steve >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >
