aokolnychyi opened a new pull request #24515: [SPARK-14083][WIP] Basic bytecode 
analyzer to speed up Datasets
URL: https://github.com/apache/spark/pull/24515
 
 
   **Disclaimer!** The purpose of this PoC PR is to trigger a discussion on 
whether it is feasible to leverage bytecode analysis to speed up Datasets. This 
PR shows what can be achieved by interpreting stack-based bytecode directly. 
The current version does not handle all edge cases. If the community agrees 
that bytecode analysis can give substantial benefits, we will need to define 
the scope (i.e., which cases we are targeting) and decide which approach to 
take (e.g., stack-based bytecode interpretation, flow-control graphs, AST, 
Jimple). I would like to emphasize that this PR doesn't answer those questions. 
Instead, it shows what can be done with the simplest approach. Also, there is 
no intention to integrate this logic into Spark directly. It might be a 
separate package.
   
   ### Scope
   
   Bytecode analysis can be used for several purposes:
   - Provide information on what's going on inside closures so that Spark can 
perform additional optimizations while still relying on closures during 
execution (e.g., analyze which fields are accessed/modified and use that 
information to optimize the plan).
   - Replace typed operations that rely on closures with equivalent untyped 
operations that rely on Catalyst expressions.
   
   Rewriting closures is challenging but gives more benefits. So, this PR 
focuses on the second use case.
   
   Right now, the code covers typed map and filter operations that involve 
primitive values, boxed values (not all are implemented), objects that are 
represented as structs (e.g., case classes). The same logic can be applied to 
UDFs/UDAFs and other operations.
   
   ### Algorithm
   
   There are two ways to derive Catalyst expressions:
   - Translate stack-based bytecode directly.
   - Convert bytecode into an intermediate format that it is easier to work 
with (e.g., flow-control graphs, AST, Jimple).
   
   Both approaches have their own tradeoffs, which are well-described in 
comments to [SPARK-14083](https://issues.apache.org/jira/browse/SPARK-14083). 
This PR translates stack-based bytecode directly and simulates what happens to 
the operand stack. The optimizer is supposed to simplify the derived 
expression. I think it is valid to start with the simplest approach and 
evaluate its limitations before considering a more advanced implementation.
   
   Originally, we wanted to cover trivial cases. However, the current scope is 
much bigger, so we should consider an intermediate format and will be glad to 
hear more opinions. As it was mentioned before, the decision highly depends on 
our target use cases.
   
   This PR adds a new optimizer rule `RewriteTypedOperations` that uses 
`TypedOperations` in order to convert typed operations into untyped. 
`TypedOperations` follows a trivial algorithm that is described below.
   
   #### Step 1: Get closure bytecode
   
   First of all, we need to get bytecode for the closure. Scala 2.12 migrated 
to Java lambdas and uses `LambdaMetafactory` (LMF) (seems like there are rare 
cases when Scala doesn't use LMF). This PR relies on the existing logic in 
Spark to obtain `SerializedLambda`, which has enough information to get 
bytecode for closures.
   
   Scala uses "adapted" methods for LMF to encapsulate differences in boxing 
semantics (i.e., unboxing null in Scala gives 0 and NPE in Java). The current 
code will obtain bytecode of the non-adapted method whenever the args are 
primitives to avoid a round of unnecessary boxing/unboxing.
   
   See `ClosureUtils$getMethod` for more information.
   
   #### Step 2: Build a correct local variable array
   
   Once we have bytecode for our closure, we need to build a local variable 
array that references correct Catalyst expressions. This can be achieved by 
translating the deserializer for a typed operation. Deserializers define how 
data is converted from the Spark internal format into Java objects. Frequently, 
deserializers contain `StaticInvoke`, `Invoke`, `NewInstance` expressions, 
which can be translated using the same algorithm.
   
   See `TypedOperations$convertDeserializer` for more information.
   
   #### Step 3: Create an operand stack
   
   The next step is to create an operand stack for storing partial Catalyst 
expressions. The current code uses `mutable.ArrayStack[Expression]` for this 
purpose.
   
   #### Step 4: Interpret instructions
   
   Once we have bytecode, the array of local variables and the operand stack, 
we can follow our bytecode instructions one by one and simulate what happens to 
the operand stack.
   
   #### Step 5: Assign the result back to the expected attributes
   
   Once we have the result expression, we need to assign it to columns. At this 
point, the serializer is important as it contains information about the 
expected attributes and their data types.
   
   ### Trying things out
   
   `DatasetBytecodeAnalysisBenchmark` and `BytecodeAnalysisSuite` show 
currently supported cases. The focus is on the conceptual approach and not on 
addressing every possible method/instruction.
   
   ### Open Questions
   
   - We need to discuss the scope of this work. As mentioned before, we can 
either translate closures into Catalyst expressions or just derive additional 
information about the content of closures (if that's useful enough).
   - We need to discuss the overall approach we want to take (e.g., bytecode 
interpretation, control-flow graphs, AST, Jimple).
   - We need to discuss which library (if any) to use.
   - We need to handle too complicated result expressions as they can slow down 
the computation. For example, we can introduce some thresholds. Apart from 
that, we can introduce more optimization rules to simplify expressions.
   - We need to ensure the conversion is lightweight and doesn't slow down the 
job planning time.
   - We need to handle cases when the conversion doesn't terminate (e.g., 
infinite recursion).
   - We need to ensure that edge cases work properly (e.g., null handling, 
exceptions, arithmetic expressions).
   - We need to decide how to properly handle flow control instructions (e.g., 
if statements). The current code handles them via recursion and jumps.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to