Re: [PR] [RFC] Add special case for textNumberPattern with a '0' pad character [daffodil]

via GitHub Wed, 10 Jan 2024 09:57:09 -0800


mbeckerle commented on code in PR #1138:
URL: https://github.com/apache/daffodil/pull/1138#discussion_r1447736641



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
       }
     }
 
+    // If textNumberPattern specifies a pad character before the number 
pattern and without a
+    // positive prefix, then ICU defaults to a pad position of 
PAD_BEFORE_PREFIX with no way to
+    // change it with just the pattern. This is reasonable for most cases, 
like when the pad
+    // character is a space. However, if the pad character in 
textNumberPattern is '0', then
+    // negative numbers are padded with a '0' before the negative sign. For 
example, a pattern
+    // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what 
the user wants
+    // with this pattern.
+    //
+    // So in this very specific case, we change the pad position to 
PAD_AFTER_PREFIX so the zero
+    // pad character appears after the negative sign, e.g. "-0123". Note that 
the check for
+    // format width > 0 is used to test if padding is enabled at all--there is 
no specific API
+    // for this, and this is how ICU determines to add padding or not.
+    //
+    // If a user really wants '0' characters to the left of the negative sign, 
they can use
+    // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils 
padding logic
+    // instead of ICUs.
+    if (
+      df.getFormatWidth > 0 &&
+      df.getPadCharacter == '0' &&

Review Comment:
   One other question: I didn't see a description saying there could be only 
one pad character placement. 
   Do we know that this is illegal:  `* +*x####;* -*x####"` , which has padding 
both before (spaces) and after ("x" chars) the prefix?



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
       }
     }
 
+    // If textNumberPattern specifies a pad character before the number 
pattern and without a
+    // positive prefix, then ICU defaults to a pad position of 
PAD_BEFORE_PREFIX with no way to
+    // change it with just the pattern. This is reasonable for most cases, 
like when the pad
+    // character is a space. However, if the pad character in 
textNumberPattern is '0', then
+    // negative numbers are padded with a '0' before the negative sign. For 
example, a pattern
+    // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what 
the user wants
+    // with this pattern.
+    //
+    // So in this very specific case, we change the pad position to 
PAD_AFTER_PREFIX so the zero
+    // pad character appears after the negative sign, e.g. "-0123". Note that 
the check for
+    // format width > 0 is used to test if padding is enabled at all--there is 
no specific API
+    // for this, and this is how ICU determines to add padding or not.
+    //
+    // If a user really wants '0' characters to the left of the negative sign, 
they can use
+    // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils 
padding logic
+    // instead of ICUs.
+    if (
+      df.getFormatWidth > 0 &&
+      df.getPadCharacter == '0' &&

Review Comment:
   Let me explore the property idea. 
   
   I think the property name should contain "ICU" e.g.,  
dfdlx:icuNumberPatternPadPosition to distinguish it from the DFDL properties 
for text numbers like dfdl:textNumberJustification, etc. 
   
   It does bring up the issue of having that property defined, but having the 
pattern contradicting it, as in 
   the property has value beforePrefix, but the pattern is `"+* 00000;-* 0000" 
which is after prefix. One would still need the "*" in the pattern to define 
the ICU pad character to be used, but this brings up the possibility of it 
being ambiguous and or contradictory. 
   
   We would still need the "*" and padding character in the pattern as a way of 
specifying the padding character ICU would use, but... we would ignore where 
that was expressed and use the property unless the value is 'fromPattern" which 
defaults to beforePrefix  behavior when ambiguous. (Not sure what it defaults 
to if the ambiguity is about the position relative to a suffix, but it's 
clearly one or other of beforeSuffix or afterSuffix)
   
   This has the drawback that our patterns would not behave like ICU patterns 
really in that in many cases where the "* " notation appears in the positive 
pattern would be ignored. 
   
   I guess we could document as the property only being used if the prefix or 
suffix of the positive pattern is empty so that it is ambiguous. But one could 
still have contradictory information in the negative pattern like:
   ```
   dfdlx:icuNumberPatternPadPosition="beforePrefix"
   dfdlx:textNumberPattern="* 00000;-* 00000"
   ```
   In this the positive pattern is ambiguous about the pad position w.r.t 
prefix because the prefix is empty, but the negative pattern is not. It clearly 
shows the padding as after prefix. But I guess ICU lets you put all sorts of 
noise in the negative pattern that ends up being ignored. 
   
   I really do think that if the pad position is ambiguous in the positive 
pattern, then the negative pattern should be used to determine it, as I think 
this would be the typical case. If the negative pattern is absent, doesn't 
express pad position, or is also ambiguous then I guess it should default to 
beforePrefix which is the current behavior. 
   
   Ok, I've decided all this seems rather squishy and hard to explain compared 
with saying that:
   - Daffodil parses and unparses text numbers using the textNumberPattern 
where the positive pattern is used for positive numbers, the negative pattern 
(if defined) is used for negative numbers. 
   - This differs from the ICU specification which says that only the prefix 
and suffix definitions are taken from the negative pattern, and other parts of 
the negative pattern are ignored. 
   - This enables the negative pattern to have different (usually fewer) digit 
specifiers which is a capability needed for fixed length data situations when 
the sign indication (prefix and suffix) occupies character positions that can 
be digits for a positive value. 
   - An example is a fixed length 5 character field that can contain 5 digits 
for a positive value, but a minus sign and only 4 digits for a negative value. 
Normal ICU behavior using pattern `"*000000;-0000"` will output 6 characters 
for -12 being `"0-0012"`as normal ICU ignores the fact that the negative 
pattern has only 4 zeros. DFDL will respect the 4 zeros in the negative pattern 
and will output `"-0012"` occupying only the allowed 5 characters. 
   
   



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
       }
     }
 
+    // If textNumberPattern specifies a pad character before the number 
pattern and without a
+    // positive prefix, then ICU defaults to a pad position of 
PAD_BEFORE_PREFIX with no way to
+    // change it with just the pattern. This is reasonable for most cases, 
like when the pad
+    // character is a space. However, if the pad character in 
textNumberPattern is '0', then
+    // negative numbers are padded with a '0' before the negative sign. For 
example, a pattern
+    // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what 
the user wants
+    // with this pattern.
+    //
+    // So in this very specific case, we change the pad position to 
PAD_AFTER_PREFIX so the zero
+    // pad character appears after the negative sign, e.g. "-0123". Note that 
the check for
+    // format width > 0 is used to test if padding is enabled at all--there is 
no specific API
+    // for this, and this is how ICU determines to add padding or not.
+    //
+    // If a user really wants '0' characters to the left of the negative sign, 
they can use
+    // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils 
padding logic
+    // instead of ICUs.
+    if (
+      df.getFormatWidth > 0 &&
+      df.getPadCharacter == '0' &&

Review Comment:
   Another wild idea. Why not add a whole second textNumberPattern property to 
be used for negative numbers, when those need special treatment. 
   
   So you can specify a regular ICU pattern with pos + neg parts if you want 
the regular ICU behavior.
   You specify textNumberPositivePattern and textNumberNegativePattern as 
separate properties if you need negative numbers to be parsed with the same 
full behavior that is used for positive numbers (i.e.,without ignoring things 
in the negative pattern like ICU normally would). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [RFC] Add special case for textNumberPattern with a '0' pad character [daffodil]

Reply via email to