rednaxelafx commented on pull request #31709: URL: https://github.com/apache/spark/pull/31709#issuecomment-791275810
So I did some investigation and found that: - For the test case I gave in this PR (emulating a kind of Scala class nesting scenario), before this PR it'll fail to codegen because of the "Malformed class name" issue. So that's something that never worked before. In practice, it isn't that hard to hit it, if you run Spark code through a Scala REPL, because the Scala REPL would use either nested objects or nested classes to implement the notion of multiple commands. (Well, there are some REPL-specific code paths in Spark that avoids hitting this problem in some cases though...e.g. [this one](https://github.com/apache/spark/blob/dbce74d39d192f702972df06382591b3e8e43a77/core/src/main/scala/org/apache/spark/util/Utils.scala#L2920-L2929) ) - After applying this PR, the use of `Utils.getSimpleName` actually returns a wrong name for generated-Java-code's compilation purposes. We do need a real "simple name", but when the Scala nested class is not generated by the Scala REPL, the one returned by `Utils.getSimpleName` only strips off the package name and maybe the trailing dollar-sign, but that's it. So in the case of the test case in this PR, we'd want the simple name to be `MalformedNameExample`, but we'd actually get back `ExpressionEncoderSuite$MalformedClassObject$MalformedNameExample` instead. That'll cause Janino compilation to fail. - Actually, what's more interesting is the different rules of Java's nested classes and Scala's. Since there is no such notion of a "companion object" or built-in "singleton object" notion on the Java language level, at least not in Java 8, the Java language syntax for creating a new instance of a inner class object doesn't work well for Scala's `class Foo { object Bar { class Baz {...} } }` kind of nesting, i.e. even when we do use the correct "simple name" (e.g. `MalformedNameExample`), there would still be no valid Java syntax to create it (i.e. `outerObj.new InnerClass(...)` doesn't work when `outerObj` is a Scala `object`) - Then comes what's interesting. In Spark's tests, some environment setup will cause the codegen fallback to be in no-fallback mode, meaning a failed compilation will cause test failure; in some other setup, the fallback to interpreted is allowed, so compilation failure does not cause test failure. The reason why my test case in this PR sometimes pass and sometimes fail is exactly that: the compilation actually always fail due to incorrect generated code, but in some test setup it's allowed to fallback so it doesn't fail the test, but in some other setup it's banned from falling back, thus failing the test. At this point I think it's a bit moot to make the generated code actually work for the `class Foo { object Bar { class Baz {...} } }` pattern mentioned here, as there's no Java syntax available for it anyway. Maybe there are workarounds, but we'll have to investigate a bit more to find them. The original intent of this PR was to make sure when using `Dataset` on this kind of Scala class nesting it can still run without user-visible failures. I think that intent is fulfilled with the fix in this PR when running in non-testing mode -- the codegen fallback to interpreter would step in and make it work, instead of getting a JDK `java.lang.InternalError` that Spark's codegen doesn't handle. cc @retronym I wonder if we can get some help from some wonderful Scala folks ^_^ Thanks! cc @maropu @viirya @kiszk @cloud-fan do @dongjoon-hyun @HyukjinKwon you have any idea what the environment setup differences are with running `build/sbt catalyst/test` vs `build/sbt "catalyst/testOnly *ExpressionEncoderSuite"`? On branch-3.1, the former shows the compilation error as test failure, whereas the latter doesn't show it but under the hood the same compilation error actually happened as well... --- Some more background on Java's "object creation expression", aka the `new` expression. Roughly speaking, there are two variants of the `new` expression in Java, 1. Unqualified: that's the regular `new ClassName(args)` form. The way to create non-nested object instances in the normal case, but can also create nested object instances if the enclosing object is `this`. 2. Qualified: the way to create nested objects. The syntax looks like `outerObjRef.new InnerClassSimpleName(args)`. The `outerObjRef` here would be an expression that refers to the enclosing object of the inner object, and the `InnerClassSimpleName` would be a simple class name of the inner class. Note that both `javac` and Janino demands that this `InnerClassSimpleName` cannot have dots (`.`) in it, so you can't use the fully qualified class name in this place. The compiler will check if the `InnerClassSimpleName` is indeed a member class of the static type of the `outerObjRef`. In Scala, when you declare: ```scala object Foo { class Bar { } } ``` then `Foo`'s class name is `Foo$`, but `Bar`'s class name isn't in the Java convention of `outerClassName + '$' + innerClassSimpleName` (which would have been `Foo$$Bar`), but instead it's just `Foo$Bar`. There's quite a bit of information recorded in the Class file that at runtime, reflection is still able to recognize the enclosing/member relationships, e.g. ``` val fooDollarClazz = Foo.getClass fooDollarClazz.getDeclaredClasses //=> Array[Class[_]] = Array(class Foo$Bar) val barClazz = classOf[Foo.Bar] barClazz.getEnclosingClass //=> Class[_] = class Foo$ barClazz.getDeclaringClass //=> Class[_] = class Foo$ ``` When Janino tries to find a member type, unfortunately, it doesn't use runtime reflection to figure out the member type, but instead it just relies on the Java naming convention of `outerClassName + '$' + innerClassSimpleName`, https://github.com/janino-compiler/janino/blob/8142ec43a3be288e214de30c9f57fe50e9047766/janino/src/main/java/org/codehaus/janino/IClass.java#L803-L876 , like this: ```java String memberDescriptor = Descriptor.fromClassName( Descriptor.toClassName(this.getDescriptor()) + '$' + name ); ``` which will fail to find the `Bar` type in a `fooObjRef.new Bar()` expression. Same deal with `javac`. Here's an example: foo.scala ```scala object Foo { class Bar { } } ``` TestScalaNestedType.java ```java import java.lang.reflect.*; public class TestScalaNestedType { public static void main(String[] args) throws Exception { Class<?> fooDollarClazz = Foo$.class; // Class.forName("Foo$"); Field fooModuleField = fooDollarClazz.getDeclaredField("MODULE$"); Foo$ fooObjRef = (Foo$) fooModuleField.get(null); // compiles fine up to the above fooObjRef.new Bar(); // fails compilation: error: cannot find symbol Bar in Foo$ } } ``` command line: ``` $ javac -g TestScalaNestedType.java TestScalaNestedType.java:10: error: cannot find symbol fooObjRef.new Bar(); ^ symbol: class Bar location: class Foo$ 1 error ``` If we change the `new` expression to: `fooObjRef.new Foo.Bar()`, then the error becomes: ``` $ javac -g TestScalaNestedType.java TestScalaNestedType.java:10: error: '(' expected fooObjRef.new Foo.Bar(); ^ 1 error ``` obviously the dot (`.`) isn't accepted there. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org