guojidan commented on PR #9413: URL: https://github.com/apache/arrow-datafusion/pull/9413#issuecomment-1980333704
> I think that this manner of rewriting might have several downsides that need to be considered. > > 1. Because `NVL` no longer exists in the function list, some of its features will be missing. For example, the feature in issue [Add ScalarUDFs in missing function hints #9392](https://github.com/apache/arrow-datafusion/issues/9392), and the hint after entering wrong parameters. > > On this PR's branch: > > ```shell > DataFusion CLI v36.0.0 > ❯ select nvl(1); > Error during planning: Invalid function 'nvl'. > Did you mean 'nanvl'? > ``` > > The hint on the main branch would be better. > > ``` > ❯ select nvl(1); > Error during planning: No function matches the given name and argument types 'nvl(Int64)'. You might need to add explicit type casts. > Candidate functions: > nvl(Boolean/UInt8/UInt16/UInt32/UInt64/Int8/Int16/Int32/Int64/Float32/Float64/Utf8/LargeUtf8, Boolean/UInt8/UInt16/UInt32/UInt64/Int8/Int16/Int32/Int64/Float32/Float64/Utf8/LargeUtf8) > ``` > > 2. There may be incorrect result when NVL's parameters contain a [volatile](https://github.com/apache/arrow-datafusion/blob/b2ff249bfb918ac6697dbc92b51262a7bdbb5971/datafusion/expr/src/expr.rs#L1903) expression. > For example, given an expression `expr` is `nullif(random()>0.5, true)`. The result of `select nvl(expr, false)`should always be false. > But if it is rewritten as `nvl(expr, expr, false)`, it could potentially yield a NULL value, when the random result of the first `expr` is less than 0.5, and the random result of the second `expr` is greater than 0.5. > > I have performed such a test on this branch. > > ``` > DataFusion CLI v36.0.0 > ❯ select nvl(nullif(random()>0.5, true), false) is null from (unnest(range(1,100))); > +--------------------------------------------------------------------------------------------------------------------------+ > | nvl2(nullif(random() > Float64(0.5),Boolean(true)),nullif(random() > Float64(0.5),Boolean(true)),Boolean(false)) IS NULL | > +--------------------------------------------------------------------------------------------------------------------------+ > | true | > | false | > | false | > | false | > | false | > ``` > > There is approximately a 25% chance of an incorrect result. > > ```shell > ❯ select count(*) from ( > select > nvl(nullif(random()>0.5, true), false) > from (unnest(range(1,10000))) > ) as t(a) > where a is null; > +----------+ > | COUNT(*) | > +----------+ > | 2517 | > +----------+ > ``` yes, need to redesign rewrite manner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
