mbeckerle commented on code in PR #1138:
URL: https://github.com/apache/daffodil/pull/1138#discussion_r1447736641
##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
}
}
+ // If textNumberPattern specifies a pad character before the number
pattern and without a
+ // positive prefix, then ICU defaults to a pad position of
PAD_BEFORE_PREFIX with no way to
+ // change it with just the pattern. This is reasonable for most cases,
like when the pad
+ // character is a space. However, if the pad character in
textNumberPattern is '0', then
+ // negative numbers are padded with a '0' before the negative sign. For
example, a pattern
+ // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what
the user wants
+ // with this pattern.
+ //
+ // So in this very specific case, we change the pad position to
PAD_AFTER_PREFIX so the zero
+ // pad character appears after the negative sign, e.g. "-0123". Note that
the check for
+ // format width > 0 is used to test if padding is enabled at all--there is
no specific API
+ // for this, and this is how ICU determines to add padding or not.
+ //
+ // If a user really wants '0' characters to the left of the negative sign,
they can use
+ // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils
padding logic
+ // instead of ICUs.
+ if (
+ df.getFormatWidth > 0 &&
+ df.getPadCharacter == '0' &&
Review Comment:
One other question: I didn't see a description saying there could be only
one pad character placement.
Do we know that this is illegal: `* +*x####;* -*x####"` , which has padding
both before (spaces) and after ("x" chars) the prefix?
##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
}
}
+ // If textNumberPattern specifies a pad character before the number
pattern and without a
+ // positive prefix, then ICU defaults to a pad position of
PAD_BEFORE_PREFIX with no way to
+ // change it with just the pattern. This is reasonable for most cases,
like when the pad
+ // character is a space. However, if the pad character in
textNumberPattern is '0', then
+ // negative numbers are padded with a '0' before the negative sign. For
example, a pattern
+ // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what
the user wants
+ // with this pattern.
+ //
+ // So in this very specific case, we change the pad position to
PAD_AFTER_PREFIX so the zero
+ // pad character appears after the negative sign, e.g. "-0123". Note that
the check for
+ // format width > 0 is used to test if padding is enabled at all--there is
no specific API
+ // for this, and this is how ICU determines to add padding or not.
+ //
+ // If a user really wants '0' characters to the left of the negative sign,
they can use
+ // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils
padding logic
+ // instead of ICUs.
+ if (
+ df.getFormatWidth > 0 &&
+ df.getPadCharacter == '0' &&
Review Comment:
Let me explore the property idea.
I think the property name should contain "ICU" e.g.,
dfdlx:icuNumberPatternPadPosition to distinguish it from the DFDL properties
for text numbers like dfdl:textNumberJustification, etc.
It does bring up the issue of having that property defined, but having the
pattern contradicting it, as in
the property has value beforePrefix, but the pattern is `"+* 00000;-* 0000"
which is after prefix. One would still need the "*" in the pattern to define
the ICU pad character to be used, but this brings up the possibility of it
being ambiguous and or contradictory.
We would still need the "*" and padding character in the pattern as a way of
specifying the padding character ICU would use, but... we would ignore where
that was expressed and use the property unless the value is 'fromPattern" which
defaults to beforePrefix behavior when ambiguous. (Not sure what it defaults
to if the ambiguity is about the position relative to a suffix, but it's
clearly one or other of beforeSuffix or afterSuffix)
This has the drawback that our patterns would not behave like ICU patterns
really in that in many cases where the "* " notation appears in the positive
pattern would be ignored.
I guess we could document as the property only being used if the prefix or
suffix of the positive pattern is empty so that it is ambiguous. But one could
still have contradictory information in the negative pattern like:
```
dfdlx:icuNumberPatternPadPosition="beforePrefix"
dfdlx:textNumberPattern="* 00000;-* 00000"
```
In this the positive pattern is ambiguous about the pad position w.r.t
prefix because the prefix is empty, but the negative pattern is not. It clearly
shows the padding as after prefix. But I guess ICU lets you put all sorts of
noise in the negative pattern that ends up being ignored.
I really do think that if the pad position is ambiguous in the positive
pattern, then the negative pattern should be used to determine it, as I think
this would be the typical case. If the negative pattern is absent, doesn't
express pad position, or is also ambiguous then I guess it should default to
beforePrefix which is the current behavior.
Ok, I've decided all this seems rather squishy and hard to explain compared
with saying that:
- Daffodil parses and unparses text numbers using the textNumberPattern
where the positive pattern is used for positive numbers, the negative pattern
(if defined) is used for negative numbers.
- This differs from the ICU specification which says that only the prefix
and suffix definitions are taken from the negative pattern, and other parts of
the negative pattern are ignored.
- This enables the negative pattern to have different (usually fewer) digit
specifiers which is a capability needed for fixed length data situations when
the sign indication (prefix and suffix) occupies character positions that can
be digits for a positive value.
- An example is a fixed length 5 character field that can contain 5 digits
for a positive value, but a minus sign and only 4 digits for a negative value.
Normal ICU behavior using pattern `"*000000;-0000"` will output 6 characters
for -12 being `"0-0012"`as normal ICU ignores the fact that the negative
pattern has only 4 zeros. DFDL will respect the 4 zeros in the negative pattern
and will output `"-0012"` occupying only the allowed 5 characters.
##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
}
}
+ // If textNumberPattern specifies a pad character before the number
pattern and without a
+ // positive prefix, then ICU defaults to a pad position of
PAD_BEFORE_PREFIX with no way to
+ // change it with just the pattern. This is reasonable for most cases,
like when the pad
+ // character is a space. However, if the pad character in
textNumberPattern is '0', then
+ // negative numbers are padded with a '0' before the negative sign. For
example, a pattern
+ // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what
the user wants
+ // with this pattern.
+ //
+ // So in this very specific case, we change the pad position to
PAD_AFTER_PREFIX so the zero
+ // pad character appears after the negative sign, e.g. "-0123". Note that
the check for
+ // format width > 0 is used to test if padding is enabled at all--there is
no specific API
+ // for this, and this is how ICU determines to add padding or not.
+ //
+ // If a user really wants '0' characters to the left of the negative sign,
they can use
+ // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils
padding logic
+ // instead of ICUs.
+ if (
+ df.getFormatWidth > 0 &&
+ df.getPadCharacter == '0' &&
Review Comment:
Another wild idea. Why not add a whole second textNumberPattern property to
be used for negative numbers, when those need special treatment.
So you can specify a regular ICU pattern with pos + neg parts if you want
the regular ICU behavior.
You specify textNumberPositivePattern and textNumberNegativePattern as
separate properties if you need negative numbers to be parsed with the same
full behavior that is used for positive numbers (i.e.,without ignoring things
in the negative pattern like ICU normally would).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]