[
https://issues.apache.org/jira/browse/DAFFODIL-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Beckerle reassigned DAFFODIL-1597:
------------------------------------------
Assignee: Michael Beckerle
> Too many ways that encoding, byteOrder, etc. are being setup
> ------------------------------------------------------------
>
> Key: DAFFODIL-1597
> URL: https://issues.apache.org/jira/browse/DAFFODIL-1597
> Project: Daffodil
> Issue Type: Improvement
> Components: Back End, Clean Ups
> Affects Versions: 2.0.0
> Reporter: Michael Beckerle
> Assignee: Michael Beckerle
> Priority: Major
> Fix For: deferred
>
>
> This issue affects both parser and unparser.
> It is a code-cleanup/maintainability, and perhaps a bit of a performance
> issue.
> I'll use encoding as the property for discussion here, but the point applies
> to byteOrder and possibly bitOrder and maybe a few other things like fillByte
> also.
> We have this idea that since encoding often doesn't change at all in a data
> format, that we insert an encoding-change processor (parser or unparser)
> which just sets the new setting, which is then stored in the
> DataInputStream/DataOutputStream object, and no per-element overhead is
> encountered for setting up this property.
> For some formats, encoding changes. This becomes quite tricky if the change
> occurs inside of a repeating element, as in that case, each repeat might need
> to begin by setting the encoding to the right one for the start of the
> repeating element, but then inside that element another set might change it,
> and that might last until the end of the repeating element, hence, at the
> start of the next repeat, we must set it back to the proper encoding for the
> start of the element.
> So, unless the DFDL compiler can optimize it out, any repeating element
> containing any text must begin with an encodingChange processor. Right now
> the optimization looks for whether the entire schema has uniform encoding
> (which is common), but any format that has a runtime-valued expression for
> encoding will be deemed "unknowable", and the encoding will be assumed to be
> variable. This is very pessimistic, but a better analysis depends on
> determining that a runtime-valued expression for encoding is still defined at
> a high-enough scope (such as top-level) that encoding is not
> subject-to-change during the element in question. The analysis being done is
> not that sophisticated currently.
> The above might not matter much for encoding, but for byteOrder, where we
> know there are formats that begin with a byte-order indication (such as
> PCAP), this matters.
> Now, there are also some parsers (and unparser) primitive processors that
> expliclity set encoding, or byteOrder. before they carry out their
> processing.
> It is possible that encoding change processors are not being inserted
> everywhere they are needed; hence, if the various setEncoding calls are
> removed, it may break things.
> Now, setting the encoding can check if the encoding is the same one it can do
> little work other than an equality check. So rather than having lots of
> encoding-change processors that are being inserted due to unsophisticated
> compile-time optimization, it may actually be better performance to either
> simply call setEncoding before every textual primitive operation, or it may
> be better to use Evaluatables, and have the data stream layer access encoding
> information via the Evaluatables mechanism. Then at least we would have only
> one code path to focus on for performance improvement.
> For unparsing:
> slightly more complicated - any suspended operation needs to "freeze" the
> value of encoding that is used to unparse, so that subsequent non-suspended
> unparser operations that change the encoding will not result in the suspended
> operation using the wrong one.
> However, since the encoding change unparser will have been run before the
> suspension is created, and since the data output stream of a suspension is
> cloned for the suspension, the encoding should be correctly set for unparsers
> that are suspended. *Should* being the key operative word here. This needs to
> be verified. (Also, the cloning of the data output stream state is itself a
> big performance worry, so the work done there needs to be minimized.)
> Whatever mechanism is chosen to resolve the parser issue, the unparser should
> work the same way.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)