Hi, All. We shared many opinions in different perspectives. However, we didn't reach a consensus even on a partial merge by excluding something (on the PR by me, on this mailing thread by Wenchen).
For the following claims, we have another alternative to mitigate it. > I don't like it because it promotes the row-parameter API and forces users to implement it, even if the users want to only use the individual-parameters API. Why don't we merge the AS-IS PR by adding something instead of excluding something? - R produceResult(InternalRow input); + default R produceResult(InternalRow input) throws Exception { + throw new UnsupportedOperationException(); + } By providing the default implementation, it will not *forcing users to implement it* technically. And, we can provide a document about our expected usage properly. What do you think? Bests, Dongjoon. On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue <rb...@netflix.com> wrote: > Yes, GenericInternalRow is safe if when type mismatches, with the cost of > using Object[], and primitive types need to do boxing > > The question is not whether to use the magic functions, which would not > need boxing. The question here is whether to use multiple ScalarFunction > interfaces. Those interfaces would require boxing or using Object[] so > there isn’t a benefit. > > If we do want to reuse one UDF for different types, using “magical > methods” solves the problem > > Yes, that’s correct. We agree that magic methods are a good option for > this. > > Again, the question we need to decide is whether to use InternalRow or > interfaces like ScalarFunction2 for non-codegen. The option to use > multiple interfaces is limited by type erasure because you can only have > one set of type parameters. If you wanted to support both > ScalarFunction2<Integer, > Integer> and ScalarFunction2<Long, Long> you’d have to fall back to > ScalarFunction2<Object, > Object> and cast. > > The point is that type erasure will commonly lead either to many different > implementation classes (one for each type combination) or will lead to > parameterizing by Object, which defeats the purpose. > > The alternative adds safety because correct types are produced by calls > like getLong(0). Yes, this depends on the implementation making the > correct calls, but it is better than using Object and casting. > > I can’t think of real use cases that will force the individual-parameters > approach to use Object instead of concrete types. > > I think this is addressed by the type erasure discussion above. A simple > Plus method would require Object or 12 implementations for 2 arguments > and 4 numeric types. > > And basically all varargs cases would need to use Object[]. Consider a > UDF to create a map that requires string keys and some consistent type for > values. This would be easy with the InternalRow API because you can use > getString(pos) and get(pos + 1, valueType) to get the key/value pairs. > Use of UTF8String vs String will be checked at compile time. > > I agree that Object[] is worse than InternalRow > > Yes, and if we are always using Object because of type erasure or using > magic methods to get better performance, the utility of the parameterized > interfaces is very limited. > > Because we want to expose the magic functions, the use of ScalarFunction2 > and similar is extremely limited because it is only for non-codegen. > Varargs is by far the more common case. The InternalRow interface is also > a very simple way to get started and ensures that Spark can always find the > right method after the function is bound to input types. > > On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Yes, GenericInternalRow is safe if when type mismatches, with the cost >> of using Object[], and primitive types need to do boxing. And this is a >> runtime failure, which is absolutely worse than query-compile-time checks. >> Also, don't forget my previous point: users need to specify the type and >> index such as row.getLong(0), which is error-prone. >> >> > But we don’t do that for any of the similar UDFs today so I’m skeptical >> that this would actually be a high enough priority to implement. >> >> I'd say this is a must-have if we go with the individual-parameters >> approach. The Scala UDF today checks the method signature at compile-time, >> thanks to the Scala type tag. The Java UDF today doesn't check and is hard >> to use. >> >> > You can’t implement ScalarFunction2<Integer, Integer> and >> ScalarFunction2<Long, Long>. >> >> Can you elaborate? We have function binding and we can use different UDFs >> for different input types. If we do want to reuse one UDF >> for different types, using "magical methods" solves the problem: >> class MyUDF { >> def call(i: Int): Int = ... >> def call(l: Long): Long = ... >> } >> >> On the other side, I don't think the row-parameter approach can solve >> this problem. The best I can think of is: >> class MyUDF implement ScalaFunction[Object] { >> def call(row: InternalRow): Object = { >> if (int input) row.getInt(0) ... else row.getLong(0) ... >> } >> } >> >> This is worse because: 1) it needs to do if-else to check different input >> types. 2) the return type can only be Object and cause boxing issues. >> >> I agree that Object[] is worse than InternalRow. But I can't think of >> real use cases that will force the individual-parameters approach to use >> Object instead of concrete types. >> >> >> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> Thanks for adding your perspective, Erik! >>> >>> If the input is string type but the UDF implementation calls >>> row.getLong(0), it returns wrong data >>> >>> I think this is misleading. It is true for UnsafeRow, but there is no >>> reason why InternalRow should return incorrect values. >>> >>> The implementation in GenericInternalRow >>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35> >>> would throw a ClassCastException. I don’t think that using a row is a >>> bad option simply because UnsafeRow is unsafe. >>> >>> It’s unlikely that UnsafeRow would be used to pass the data. The >>> implementation would evaluate each argument expression and set the result >>> in a generic row, then pass that row to the UDF. We can use whatever >>> implementation we choose to provide better guarantees than unsafe. >>> >>> I think we should consider query-compile-time checks as nearly-as-good >>> as Java-compile-time checks for the purposes of safety. >>> >>> I don’t think I agree with this. A failure at query analysis time vs >>> runtime still requires going back to a separate project, fixing something, >>> and rebuilding. The time needed to fix a problem goes up significantly vs. >>> compile-time checks. And that is even worse if the UDF is maintained by >>> someone else. >>> >>> I think we also need to consider how common it would be that a use case >>> can have the query-compile-time checks. Going through this in more detail >>> below makes me think that it is unlikely that these checks would be used >>> often because of the limitations of using an interface with type erasure. >>> >>> I believe that Wenchen’s proposal will provide stronger >>> query-compile-time safety >>> >>> The proposal could have better safety for each argument, assuming that >>> we detect failures by looking at the parameter types using reflection in >>> the analyzer. But we don’t do that for any of the similar UDFs today so I’m >>> skeptical that this would actually be a high enough priority to implement. >>> >>> As Erik pointed out, type erasure also limits the effectiveness. You >>> can’t implement ScalarFunction2<Integer, Integer> and ScalarFunction2<Long, >>> Long>. You can handle those cases using InternalRow or you can handle >>> them using VarargScalarFunction<Object>. That forces many use cases >>> into varargs with Object, where you don’t get any of the proposed >>> analyzer benefits and lose compile-time checks. The only time the >>> additional checks (if implemented) would help is when only one set of >>> argument types is needed because implementing ScalarFunction<Object, >>> Object> defeats the purpose. >>> >>> It’s worth noting that safety for the magic methods would be identical >>> between the two options, so the trade-off to consider is for varargs and >>> non-codegen cases. Combining the limitations discussed, this has better >>> safety guarantees only if you need just one set of types for each number of >>> arguments and are using the non-codegen path. Since varargs is one of the >>> primary reasons to use this API, then I don’t think that it is a good idea >>> to use Object[] instead of InternalRow. >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >