[jira] [Assigned] (DAFFODIL-1597) Too many ways that encoding, byteOrder, etc. are being setup

Michael Beckerle (JIRA) Thu, 25 Jan 2018 13:56:22 -0800

     [ 
https://issues.apache.org/jira/browse/DAFFODIL-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Beckerle reassigned DAFFODIL-1597:
------------------------------------------

    Assignee: Michael Beckerle

> Too many ways that encoding, byteOrder, etc. are being setup
> ------------------------------------------------------------
>
>                 Key: DAFFODIL-1597
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1597
>             Project: Daffodil
>          Issue Type: Improvement
>          Components: Back End, Clean Ups
>    Affects Versions: 2.0.0
>            Reporter: Michael Beckerle
>            Assignee: Michael Beckerle
>            Priority: Major
>             Fix For: deferred
>
>
> This issue affects both parser and unparser.
> It is a code-cleanup/maintainability, and perhaps a bit of a performance 
> issue.
> I'll use encoding as the property for discussion here, but the point applies 
> to byteOrder and possibly bitOrder and maybe a few other things like fillByte 
> also.
> We have this idea that since encoding often doesn't change at all in a data 
> format, that we insert an encoding-change processor (parser or unparser) 
> which just sets the new setting, which is then stored in the 
> DataInputStream/DataOutputStream object, and no per-element overhead is 
> encountered for setting up this property.
> For some formats, encoding changes. This becomes quite tricky if the change 
> occurs inside of a repeating element, as in that case, each repeat might need 
> to begin by setting the encoding to the right one for the start of the 
> repeating element, but then inside that element another set might change it, 
> and that might last until the end of the repeating element, hence, at the 
> start of the next repeat, we must set it back to the proper encoding for the 
> start of the element.
> So, unless the DFDL compiler can optimize it out, any repeating element 
> containing any text must begin with an encodingChange processor. Right now 
> the optimization looks for whether the entire schema has uniform encoding 
> (which is common), but any format that has a runtime-valued expression for 
> encoding will be deemed "unknowable", and the encoding will be assumed to be 
> variable. This is very pessimistic, but a better analysis depends on 
> determining that a runtime-valued expression for encoding is still defined at 
> a high-enough scope (such as top-level) that encoding is not 
> subject-to-change during the element in question. The analysis being done is 
> not that sophisticated currently. 
> The above might not matter much for encoding, but for byteOrder, where we 
> know there are formats that begin with a byte-order indication (such as 
> PCAP), this matters. 
> Now, there are also some parsers (and unparser) primitive processors that 
> expliclity set encoding, or byteOrder. before they carry out their 
> processing. 
> It is possible that encoding change processors are not being inserted 
> everywhere they are needed; hence, if the various setEncoding calls are 
> removed, it may break things. 
> Now, setting the encoding can check if the encoding is the same one it can do 
> little work other than an equality check. So rather than having lots of 
> encoding-change processors that are being inserted due to unsophisticated 
> compile-time optimization, it may actually be better performance to either 
> simply call setEncoding before every textual primitive operation, or it may 
> be better to use Evaluatables, and have the data stream layer access encoding 
> information via the Evaluatables mechanism. Then at least we would have only 
> one code path to focus on for performance improvement. 
> For unparsing: 
> slightly more complicated - any suspended operation needs to "freeze" the 
> value of encoding that is used to unparse, so that subsequent non-suspended 
> unparser operations that change the encoding will not result in the suspended 
> operation using the wrong one. 
> However, since the encoding change unparser will have been run before the 
> suspension is created, and since the data output stream of a suspension is 
> cloned for the suspension, the encoding should be correctly set for unparsers 
> that are suspended. *Should* being the key operative word here. This needs to 
> be verified. (Also, the cloning of the data output stream state is itself a 
> big performance worry, so the work done there needs to be minimized.)
> Whatever mechanism is chosen to resolve the parser issue, the unparser should 
> work the same way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (DAFFODIL-1597) Too many ways that encoding, byteOrder, etc. are being setup

Reply via email to