rednaxelafx commented on pull request #31709:
URL: https://github.com/apache/spark/pull/31709#issuecomment-791275810


   So I did some investigation and found that:
   - For the test case I gave in this PR (emulating a kind of Scala class 
nesting scenario), before this PR it'll fail to codegen because of the 
"Malformed class name" issue. So that's something that never worked before. In 
practice, it isn't that hard to hit it, if you run Spark code through a Scala 
REPL, because the Scala REPL would use either nested objects or nested classes 
to implement the notion of multiple commands. (Well, there are some 
REPL-specific code paths in Spark that avoids hitting this problem in some 
cases though...e.g. [this 
one](https://github.com/apache/spark/blob/dbce74d39d192f702972df06382591b3e8e43a77/core/src/main/scala/org/apache/spark/util/Utils.scala#L2920-L2929)
 )
   - After applying this PR, the use of `Utils.getSimpleName` actually returns 
a wrong name for generated-Java-code's compilation purposes. We do need a real 
"simple name", but when the Scala nested class is not generated by the Scala 
REPL, the one returned by `Utils.getSimpleName` only strips off the package 
name and maybe the trailing dollar-sign, but that's it. So in the case of the 
test case in this PR, we'd want the simple name to be `MalformedNameExample`, 
but we'd actually get back 
`ExpressionEncoderSuite$MalformedClassObject$MalformedNameExample` instead. 
That'll cause Janino compilation to fail.
   - Actually, what's more interesting is the different rules of Java's nested 
classes and Scala's. Since there is no such notion of a "companion object" or 
built-in "singleton object" notion on the Java language level, at least not in 
Java 8, the Java language syntax for creating a new instance of a inner class 
object doesn't work well for Scala's `class Foo { object Bar { class Baz {...} 
} }` kind of nesting, i.e. even when we do use the correct "simple name" (e.g. 
`MalformedNameExample`), there would still be no valid Java syntax to create it 
(i.e. `outerObj.new InnerClass(...)` doesn't work when `outerObj` is a Scala 
`object`)
   - Then comes what's interesting. In Spark's tests, some environment setup 
will cause the codegen fallback to be in no-fallback mode, meaning a failed 
compilation will cause test failure; in some other setup, the fallback to 
interpreted is allowed, so compilation failure does not cause test failure. The 
reason why my test case in this PR sometimes pass and sometimes fail is exactly 
that: the compilation actually always fail due to incorrect generated code, but 
in some test setup it's allowed to fallback so it doesn't fail the test, but in 
some other setup it's banned from falling back, thus failing the test.
   
   At this point I think it's a bit moot to make the generated code actually 
work for the `class Foo { object Bar { class Baz {...} } }` pattern mentioned 
here, as there's no Java syntax available for it anyway. Maybe there are 
workarounds, but we'll have to investigate a bit more to find them.
   
   The original intent of this PR was to make sure when using `Dataset` on this 
kind of Scala class nesting it can still run without user-visible failures. I 
think that intent is fulfilled with the fix in this PR when running in 
non-testing mode -- the codegen fallback to interpreter would step in and make 
it work, instead of getting a JDK `java.lang.InternalError` that Spark's 
codegen doesn't handle.
   
   cc @retronym I wonder if we can get some help from some wonderful Scala 
folks ^_^ Thanks!
   
   cc @maropu @viirya @kiszk @cloud-fan do @dongjoon-hyun @HyukjinKwon you have 
any idea what the environment setup differences are with running `build/sbt 
catalyst/test` vs `build/sbt "catalyst/testOnly *ExpressionEncoderSuite"`? On 
branch-3.1, the former shows the compilation error as test failure, whereas the 
latter doesn't show it but under the hood the same compilation error actually 
happened as well...
   
   ---
   
   Some more background on Java's "object creation expression", aka the `new` 
expression.
   
   Roughly speaking, there are two variants of the `new` expression in Java,
   1. Unqualified: that's the regular `new ClassName(args)` form. The way to 
create non-nested object instances in the normal case, but can also create 
nested object instances if the enclosing object is `this`.
   2. Qualified: the way to create nested objects. The syntax looks like 
`outerObjRef.new InnerClassSimpleName(args)`. The `outerObjRef` here would be 
an expression that refers to the enclosing object of the inner object, and the 
`InnerClassSimpleName` would be a simple class name of the inner class. Note 
that both `javac` and Janino demands that this `InnerClassSimpleName` cannot 
have dots (`.`) in it, so you can't use the fully qualified class name in this 
place. The compiler will check if the `InnerClassSimpleName` is indeed a member 
class of the static type of the `outerObjRef`.
   
   In Scala, when you declare:
   ```scala
   object Foo {
     class Bar { }
   }
   ```
   then `Foo`'s class name is `Foo$`, but `Bar`'s class name isn't in the Java 
convention of `outerClassName + '$' + innerClassSimpleName` (which would have 
been `Foo$$Bar`), but instead it's just `Foo$Bar`.
   There's quite a bit of information recorded in the Class file that at 
runtime, reflection is still able to recognize the enclosing/member 
relationships, e.g.
   ```
   val fooDollarClazz = Foo.getClass
   fooDollarClazz.getDeclaredClasses //=> Array[Class[_]] = Array(class Foo$Bar)
   
   val barClazz = classOf[Foo.Bar]
   barClazz.getEnclosingClass //=> Class[_] = class Foo$
   barClazz.getDeclaringClass //=> Class[_] = class Foo$
   ```
   
   When Janino tries to find a member type, unfortunately, it doesn't use 
runtime reflection to figure out the member type, but instead it just relies on 
the Java naming convention of `outerClassName + '$' + innerClassSimpleName`,
   
https://github.com/janino-compiler/janino/blob/8142ec43a3be288e214de30c9f57fe50e9047766/janino/src/main/java/org/codehaus/janino/IClass.java#L803-L876
 , like this:
   ```java
               String memberDescriptor = Descriptor.fromClassName(
                   Descriptor.toClassName(this.getDescriptor())
                   + '$'
                   + name
               );
   ```
   which will fail to find the `Bar` type in a `fooObjRef.new Bar()` expression.
   
   Same deal with `javac`. Here's an example:
   foo.scala
   ```scala
   object Foo {
     class Bar {
     }
   }
   ```
   TestScalaNestedType.java
   ```java
   import java.lang.reflect.*;
   
   public class TestScalaNestedType {
     public static void main(String[] args) throws Exception {
       Class<?> fooDollarClazz = Foo$.class; // Class.forName("Foo$");
       Field fooModuleField = fooDollarClazz.getDeclaredField("MODULE$");
       Foo$ fooObjRef = (Foo$) fooModuleField.get(null);
       // compiles fine up to the above
   
       fooObjRef.new Bar(); // fails compilation: error: cannot find symbol Bar 
in Foo$
     }
   }
   ```
   command line:
   ```
   $ javac -g TestScalaNestedType.java 
   TestScalaNestedType.java:10: error: cannot find symbol
       fooObjRef.new Bar();
                     ^
     symbol:   class Bar
     location: class Foo$
   1 error
   ```
   
   If we change the `new` expression to: `fooObjRef.new Foo.Bar()`, then the 
error becomes:
   ```
   $ javac -g TestScalaNestedType.java 
   TestScalaNestedType.java:10: error: '(' expected
       fooObjRef.new Foo.Bar();
                        ^
   1 error
   ```
   obviously the dot (`.`) isn't accepted there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to