Re: [PR] [RFC] Add special case for textNumberPattern with a '0' pad character [daffodil]

via GitHub Wed, 10 Jan 2024 07:18:14 -0800


mbeckerle commented on code in PR #1138:
URL: https://github.com/apache/daffodil/pull/1138#discussion_r1447525715



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
       }
     }
 
+    // If textNumberPattern specifies a pad character before the number 
pattern and without a
+    // positive prefix, then ICU defaults to a pad position of 
PAD_BEFORE_PREFIX with no way to
+    // change it with just the pattern. This is reasonable for most cases, 
like when the pad
+    // character is a space. However, if the pad character in 
textNumberPattern is '0', then
+    // negative numbers are padded with a '0' before the negative sign. For 
example, a pattern
+    // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what 
the user wants
+    // with this pattern.
+    //
+    // So in this very specific case, we change the pad position to 
PAD_AFTER_PREFIX so the zero
+    // pad character appears after the negative sign, e.g. "-0123". Note that 
the check for
+    // format width > 0 is used to test if padding is enabled at all--there is 
no specific API
+    // for this, and this is how ICU determines to add padding or not.
+    //
+    // If a user really wants '0' characters to the left of the negative sign, 
they can use
+    // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils 
padding logic
+    // instead of ICUs.
+    if (

Review Comment:
   I want to explore a different fix idea.
   
   Seems to me that ICU just has the behavior wrong, as in a design flaw/bug. 
   
   The ICU notion that only the positive pattern is used to determine 
everything except the prefix and suffix (something I read in ICU doc) is simply 
incorrect.
   
   The actual use case we have is one where the pattern wants to be 
"00000;-0000" but where those 4 zeros in the negative pattern are respected and 
used to create only 4 digits for the negative case. But ICU doesn't do that. 
The documented ICU behavior is that the prefix minus sign is taken from the 
negative pattern, but the digits there are treated as if the pattern was 
"00000;-#" so that the number of negative digits is ignored. 
   
   Let's call that behavior ICU1. 
   
   To me that design point in ICU is simply incorrect. Sometimes the number of 
digits should be different for positive and negative values because of the 
space taken up by prefix/suffix in fixed length contexts. 
   
   Now let's consider how to fix.
   
   I believe the desired behavior should be that if the value is negative, the 
negative pattern is used, ignoring the positive pattern entirely. 
   
   Let's call that behavior ICU2.
   
   We can implement ICU2 behavior for unparsing by just creating two separate 
patterns. One is the existing positive pattern, the other the existing negative 
pattern, but _as a positive pattern_. Ex: if the pattern was "00000;-0000" we 
would create "00000" and "-0000" with the latter being a positive pattern. 
   Then we would examine the value being unparsed and if negative, negate it so 
it is positive, then use the second pattern to unparse it, which will lay down 
the prefix minus sign as desired, along with 4 digits. 
   
   This eliminates any need to fuss with their padding stuff for this specific 
use case, but it also fixes the broken behavior of padding because for the case 
where you have "00000;-*00000" the padding would be after the prefix for the 
negative case as the positive pattern is completely ignored for the negative 
case. 
   
   This variation, what I'm calling ICU2 behavior, is fairly easy for us to 
explain in Daffodil documentation, and can be motivated with this specific use 
case as an example of why we don't just use regular ICU behavior. 
   
   This behavior is not backward compatible, so we probably want a property to 
enable this, and I'd suggest we eventually would want this to become the 
default behavior. 
   
   I think this has negligible performance impact. Just a test for whether a 
value is negative. 
   
   Thoughts?
   



##########
daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/processors/EvTextNumber.scala:
##########
@@ -186,6 +186,31 @@ class TextNumberFormatEv(
       }
     }
 
+    // If textNumberPattern specifies a pad character before the number 
pattern and without a
+    // positive prefix, then ICU defaults to a pad position of 
PAD_BEFORE_PREFIX with no way to
+    // change it with just the pattern. This is reasonable for most cases, 
like when the pad
+    // character is a space. However, if the pad character in 
textNumberPattern is '0', then
+    // negative numbers are padded with a '0' before the negative sign. For 
example, a pattern
+    // of "*0####0" unparses -123 to "0-123". This is very unlikely to be what 
the user wants
+    // with this pattern.
+    //
+    // So in this very specific case, we change the pad position to 
PAD_AFTER_PREFIX so the zero
+    // pad character appears after the negative sign, e.g. "-0123". Note that 
the check for
+    // format width > 0 is used to test if padding is enabled at all--there is 
no specific API
+    // for this, and this is how ICU determines to add padding or not.
+    //
+    // If a user really wants '0' characters to the left of the negative sign, 
they can use
+    // textPadKind/textTrimKind and textNumberPadCharacter to uses Daffodils 
padding logic
+    // instead of ICUs.
+    if (
+      df.getFormatWidth > 0 &&
+      df.getPadCharacter == '0' &&

Review Comment:
   I've seen data that does this with spaces, not just leading zeros E.g., 
where a field looks like "-     12.3" with the minus sign way over to the left. 
I have also seen "(      12.34)" 
   
   So I would remove the check that the ICU pad character is a '0'. 
   
   I think the right behavior for this fix (not the ICU1 vs. ICU2 fix mentioned 
in the other comment) is just for prefix is the positive prefix empty and the 
negative prefix non-empty, then the padding behavior should come from the 
negative case. Ditto for suffix. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [RFC] Add special case for textNumberPattern with a '0' pad character [daffodil]

Reply via email to