[ 
https://issues.apache.org/jira/browse/DRILL-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644704#comment-15644704
 ] 

Paul Rogers commented on DRILL-4777:
------------------------------------

The generated code can be improved in many ways.

1. Today, we generate code based on the "holder" pattern; we generate code that 
creates temporary "holder" instances, then we go into the byte code and do 
"scalar replacement" to replace the temporary objects with local variables. 
Better code generation would not even create the temporary objects in the first 
place.

2. Because of the holders, null checking is awkward. Instead, code can be very 
simple:

{code}
if ( left.isNull( ) && right.isNull( ) ) { return 0; }
if ( left.isNull( ) ) { return -1; }
if ( right.isNull( ) ) { return 1; }
// Do code for non-null case
{code}

To handle the "sort nulls high" and "sort nulls low" case, simply replace the 
"1" above with a variable set to either 1 (nulls low) or -1 (nulls high).

3. The code generation mechanism reverse engineers functions generates code, 
merges code and does scalar replacement. There is no evidence that this is 
necessary. Instead, do the above to avoid the need for scalar replacement and 
have the generated class extend the template to avoid the need for code merge 
(and have smaller generated files.)

4. When expressions contain multiple parts (e.g. ORDER BY a, b), we generate 
one big wad of code for both cases. But, that leads to very large code and code 
that is unlikely to be reused. Instead, generate code as a compound collection 
of classes. In (very crude) pseudo-code

{code}
class CompareVarChar { // code for the varchar case, parameterized by vector 
location
}
class Compare17 { public int doEval( ) {
  int result = CompareVarChar.compare( left, right, 1 );
  if ( result == 0 ) { result = CompareVarChar.compare( left, right, 2 );
  return result; }
{code}

5. We use a Guava LocalCache to cache code and ensure that we generate only one 
copy for each unique class. But, the key is the source code itself. Each call 
to `hash` or `equals` works with the entire source code string. Very 
inefficient. At least cache the hash value. Better, create a definition that is 
used to generate the code (the set of all parameters needed for CG), and use 
that as the key. Generate code only when necessary.

> Fuse generated code to reduce code size and gain performance improvement
> ------------------------------------------------------------------------
>
>                 Key: DRILL-4777
>                 URL: https://issues.apache.org/jira/browse/DRILL-4777
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Chunhui Shi
>            Assignee: Arina Ielchiieva
>
> Drill generates code for operators, compile the classes and load them on the 
> fly of a query. However, in large query cases, the generated code will become 
> hundreds KB or could be even bigger. We have seen multiple issues reported 
> when generated code is too big either due to Java's size limit to one method 
> or degrated performance of compiling or executing. Also when I looked at JIT 
> optimization logs, there are many complaining about 'hot method too big'
> Some measures can be considered to reduce the code size, such as, 
> 1) For now Drill embed function calls' code directly into generated code, 
> this make one line function call to become 5-10 lines code in generated java 
> classes. If we inject these functions as a private functions of the classes 
> and directly call them in main function body, this could reduce code size 
> while the cost of function call can be erased by JIT inline optimization.
> 2) Drill generate one variable for each column, if the column number became 
> dozens to one hundred, the code will be redundant, we could consider using an 
> array to store the value vectors and loop on top of it so the code size will 
> be reduced even more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to