[jira] [Closed] (DAFFODIL-1597) Too many ways that encoding, byteOrder, etc. are being setup

Dave Thompson (JIRA) Tue, 30 Jan 2018 11:49:21 -0800

     [ 
https://issues.apache.org/jira/browse/DAFFODIL-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dave Thompson closed DAFFODIL-1597.
-----------------------------------

Pulled latest updates from incubator-daffodil repository and verified the 
specified commit, ee4177b4c9bf84af2af19d8c9f67f283e233105e, was included.

Reviewed updated files to determine the effected tests. Executed the entire SBT 
test suite. Verified that the effected tests executed successfully.

> Too many ways that encoding, byteOrder, etc. are being setup
> ------------------------------------------------------------
>
>                 Key: DAFFODIL-1597
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1597
>             Project: Daffodil
>          Issue Type: Improvement
>          Components: Back End, Clean Ups
>    Affects Versions: 2.0.0
>            Reporter: Michael Beckerle
>            Assignee: Dave Thompson
>            Priority: Major
>             Fix For: 2.1.0
>
>
> This issue affects both parser and unparser.
> It is a code-cleanup/maintainability, and perhaps a bit of a performance 
> issue.
> I'll use encoding as the property for discussion here, but the point applies 
> to byteOrder and possibly bitOrder and maybe a few other things like fillByte 
> also.
> We have this idea that since encoding often doesn't change at all in a data 
> format, that we insert an encoding-change processor (parser or unparser) 
> which just sets the new setting, which is then stored in the 
> DataInputStream/DataOutputStream object, and no per-element overhead is 
> encountered for setting up this property.
> For some formats, encoding changes. This becomes quite tricky if the change 
> occurs inside of a repeating element, as in that case, each repeat might need 
> to begin by setting the encoding to the right one for the start of the 
> repeating element, but then inside that element another set might change it, 
> and that might last until the end of the repeating element, hence, at the 
> start of the next repeat, we must set it back to the proper encoding for the 
> start of the element.
> So, unless the DFDL compiler can optimize it out, any repeating element 
> containing any text must begin with an encodingChange processor. Right now 
> the optimization looks for whether the entire schema has uniform encoding 
> (which is common), but any format that has a runtime-valued expression for 
> encoding will be deemed "unknowable", and the encoding will be assumed to be 
> variable. This is very pessimistic, but a better analysis depends on 
> determining that a runtime-valued expression for encoding is still defined at 
> a high-enough scope (such as top-level) that encoding is not 
> subject-to-change during the element in question. The analysis being done is 
> not that sophisticated currently. 
> The above might not matter much for encoding, but for byteOrder, where we 
> know there are formats that begin with a byte-order indication (such as 
> PCAP), this matters. 
> Now, there are also some parsers (and unparser) primitive processors that 
> expliclity set encoding, or byteOrder. before they carry out their 
> processing. 
> It is possible that encoding change processors are not being inserted 
> everywhere they are needed; hence, if the various setEncoding calls are 
> removed, it may break things. 
> Now, setting the encoding can check if the encoding is the same one it can do 
> little work other than an equality check. So rather than having lots of 
> encoding-change processors that are being inserted due to unsophisticated 
> compile-time optimization, it may actually be better performance to either 
> simply call setEncoding before every textual primitive operation, or it may 
> be better to use Evaluatables, and have the data stream layer access encoding 
> information via the Evaluatables mechanism. Then at least we would have only 
> one code path to focus on for performance improvement. 
> For unparsing: 
> slightly more complicated - any suspended operation needs to "freeze" the 
> value of encoding that is used to unparse, so that subsequent non-suspended 
> unparser operations that change the encoding will not result in the suspended 
> operation using the wrong one. 
> However, since the encoding change unparser will have been run before the 
> suspension is created, and since the data output stream of a suspension is 
> cloned for the suspension, the encoding should be correctly set for unparsers 
> that are suspended. *Should* being the key operative word here. This needs to 
> be verified. (Also, the cloning of the data output stream state is itself a 
> big performance worry, so the work done there needs to be minimized.)
> Whatever mechanism is chosen to resolve the parser issue, the unparser should 
> work the same way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (DAFFODIL-1597) Too many ways that encoding, byteOrder, etc. are being setup

Reply via email to