mbeckerle commented on code in PR #1138:
URL: https://github.com/apache/daffodil/pull/1138#discussion_r1447525715
##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
}
}
+ // If textNumberPattern specifies a pad character before the number
pattern and without a
+ // positive prefix, then ICU defaults to a pad position of
PAD_BEFORE_PREFIX with no way to
+ // change it with just the pattern. This is reasonable for most cases,
like when the pad
+ // character is a space. However, if the pad character in
textNumberPattern is '0', then
+ // negative numbers are padded with a '0' before the negative sign. For
example, a pattern
+ // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what
the user wants
+ // with this pattern.
+ //
+ // So in this very specific case, we change the pad position to
PAD_AFTER_PREFIX so the zero
+ // pad character appears after the negative sign, e.g. "-0123". Note that
the check for
+ // format width > 0 is used to test if padding is enabled at all--there is
no specific API
+ // for this, and this is how ICU determines to add padding or not.
+ //
+ // If a user really wants '0' characters to the left of the negative sign,
they can use
+ // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils
padding logic
+ // instead of ICUs.
+ if (
Review Comment:
I want to explore a different fix idea.
Seems to me that ICU just has the behavior wrong, as in a design flaw/bug.
The ICU notion that only the positive pattern is used to determine
everything except the prefix and suffix (something I read in ICU doc) is simply
incorrect.
The actual use case we have is one where the pattern wants to be
"00000;-0000" but where those 4 zeros in the negative pattern are respected and
used to create only 4 digits for the negative case. But ICU doesn't do that.
The documented ICU behavior is that the prefix minus sign is taken from the
negative pattern, but the digits there are treated as if the pattern was
"00000;-#" so that the number of negative digits is ignored.
Let's call that behavior ICU1.
To me that design point in ICU is simply incorrect. Sometimes the number of
digits should be different for positive and negative values because of the
space taken up by prefix/suffix in fixed length contexts.
Now let's consider how to fix.
I believe the desired behavior should be that if the value is negative, the
negative pattern is used, ignoring the positive pattern entirely.
Let's call that behavior ICU2.
We can implement ICU2 behavior for unparsing by just creating two separate
patterns. One is the existing positive pattern, the other the existing negative
pattern, but _as a positive pattern_. Ex: if the pattern was "00000;-0000" we
would create "00000" and "-0000" with the latter being a positive pattern.
Then we would examine the value being unparsed and if negative, negate it so
it is positive, then use the second pattern to unparse it, which will lay down
the prefix minus sign as desired, along with 4 digits.
This eliminates any need to fuss with their padding stuff for this specific
use case, but it also fixes the broken behavior of padding because for the case
where you have "00000;-*00000" the padding would be after the prefix for the
negative case as the positive pattern is completely ignored for the negative
case.
This variation, what I'm calling ICU2 behavior, is fairly easy for us to
explain in Daffodil documentation, and can be motivated with this specific use
case as an example of why we don't just use regular ICU behavior.
This behavior is not backward compatible, so we probably want a property to
enable this, and I'd suggest we eventually would want this to become the
default behavior.
I think this has negligible performance impact. Just a test for whether a
value is negative.
Thoughts?
##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
}
}
+ // If textNumberPattern specifies a pad character before the number
pattern and without a
+ // positive prefix, then ICU defaults to a pad position of
PAD_BEFORE_PREFIX with no way to
+ // change it with just the pattern. This is reasonable for most cases,
like when the pad
+ // character is a space. However, if the pad character in
textNumberPattern is '0', then
+ // negative numbers are padded with a '0' before the negative sign. For
example, a pattern
+ // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what
the user wants
+ // with this pattern.
+ //
+ // So in this very specific case, we change the pad position to
PAD_AFTER_PREFIX so the zero
+ // pad character appears after the negative sign, e.g. "-0123". Note that
the check for
+ // format width > 0 is used to test if padding is enabled at all--there is
no specific API
+ // for this, and this is how ICU determines to add padding or not.
+ //
+ // If a user really wants '0' characters to the left of the negative sign,
they can use
+ // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils
padding logic
+ // instead of ICUs.
+ if (
+ df.getFormatWidth > 0 &&
+ df.getPadCharacter == '0' &&
Review Comment:
I've seen data that does this with spaces, not just leading zeros E.g.,
where a field looks like "- 12.3" with the minus sign way over to the left.
I have also seen "( 12.34)"
So I would remove the check that the ICU pad character is a '0'.
I think the right behavior for this fix (not the ICU1 vs. ICU2 fix mentioned
in the other comment) is just for prefix is the positive prefix empty and the
negative prefix non-empty, then the padding behavior should come from the
negative case. Ditto for suffix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]