Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/14963 )
Change subject: IMPALA-9010: Add builtin mask functions ...................................................................... Patch Set 2: (10 comments) Thanks for your coments! Addressed them. http://gerrit.cloudera.org:8080/#/c/14963/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14963/2//COMMIT_MSG@25 PS2, Line 25: number of characters > nit: Is it better to use "number of characters to retain" to make it cleare I'm afraid not. It has different meanings in different functions. In mask_show_first_n(), it's the number of characters to retain. In mask_first_n(), it's the number of characters to mask. So I think just leave it as this is better. The meaning is only clear with the function name. BTW, these descriptions are copied and merged from Hive's javadoc: https://github.com/apache/hive/blob/ae008b7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMask.java#L40-L48 https://github.com/apache/hive/blob/ae008b7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMaskShowFirstN.java#L31-L38 https://github.com/apache/hive/blob/ae008b7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMaskShowLastN.java#L31-L38 https://github.com/apache/hive/blob/ae008b7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMaskFirstN.java#L31-L38 https://github.com/apache/hive/blob/ae008b7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMaskLastN.java#L31-L38 https://github.com/apache/hive/blob/ae008b7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMaskHash.java#L32 http://gerrit.cloudera.org:8080/#/c/14963/2//COMMIT_MSG@30 PS2, Line 30: digitChar - character to replace digit characters with. Specify -1 : to retain original character. Default value: 'n' > After reading the description, I found that the difference between 'digitCh digitChar is used for string values. numberChar is used for numeric values. E.g. hive> select mask_show_first_n(cast(12345 as smallint), 3, 'x', 'x', 'x', -1, '5'); 12355 hive> select mask_show_first_n("12345", 3, 'x', 'x', 'x', -1, '5'); '123xx' http://gerrit.cloudera.org:8080/#/c/14963/2//COMMIT_MSG@34 PS2, Line 34: numberChar - character to replace digits in a number with. Valid : values: 0-9. Default value: '1' > Sorry I meant to say "Specify -1 to use the default value, i.e., 1" (if my -1 is an invalid value for numberChar. All invalid values will be treated as defalut value 1. http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/expr-test.cc@10449 PS2, Line 10449: // Error handling > What happens when one would mask the day in 2019-02-02 to 30? Could you add Done http://gerrit.cloudera.org:8080/#/c/14963/1/be/src/exprs/mask-functions-ir.cc File be/src/exprs/mask-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/14963/1/be/src/exprs/mask-functions-ir.cc@141 PS1, Line 141: } > Awesome, thanks! Done http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/mask-functions-ir.cc File be/src/exprs/mask-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/mask-functions-ir.cc@223 PS2, Line 223: !(1 <= day_value && day_value <= 31) > This considers 31 as a valid day number for eg. February. Shouldn't this be Good point! Hive will round the additional days to the next month... hive> select mask(cast('2019-02-02' as date), -1, -1, -1, -1, -1, 29, -1, -1); 2019-03-01 hive> select mask(cast('2019-02-02' as date), -1, -1, -1, -1, -1, 31, -1, -1); 2019-03-03 We currently return NULL for these cases. I think it's due to the different behaviors of DATE between Hive and Impala. E.g. cast('2019-02-31' as date) results to '2019-03-03' in Hive but results to error in Impala. Updated the description about this. I also found that Hive treats the yearValue as starting at 1900. So yearValue=0 means masking year field to 1900 actually. That's not as said by the descriptions. What's worse, Hive can't mask year to 1899 since -1 already means retaining original value. So I created HIVE-22711 hoping Hive can change its behavior. http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/mask-functions-ir.cc@266 PS2, Line 266: 4 > Wouldn't it be nicer if this constant were defined at the beginning of the Done http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/mask-functions-ir.cc@697 PS2, Line 697: (void)SHA256(val.ptr, val.len, sha256_hash.ptr); > nit: Wouldn't using "discard_result" be nicer here? Done http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/mask-functions.h File be/src/exprs/mask-functions.h: http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/mask-functions.h@50 PS2, Line 50: number of characters > nit: Is it better to use "number of characters to retain" to make it cleare Ack http://gerrit.cloudera.org:8080/#/c/14963/2/be/src/exprs/mask-functions.h@59 PS2, Line 59: /// numberChar - character to replace digits in a number with. Valid values: 0-9. : /// Default value: '1' > Sorry I meant to say "Specify -1 to use the default value, i.e., 1" (if my -1 is an invalid value for numberChar. All invalid values (-1, 10, 99...) will be treated as defalut value 1. -- To view, visit http://gerrit.cloudera.org:8080/14963 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664 Gerrit-Change-Number: 14963 Gerrit-PatchSet: 2 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Fang-Yu Rao <fangyu....@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 09 Jan 2020 09:06:45 +0000 Gerrit-HasComments: Yes