[GitHub] spark pull request: [Spark-14138][SQL] Fix generated SpecificColum...

kiszk Sat, 26 Mar 2016 21:28:12 -0700

GitHub user kiszk opened a pull request:

    https://github.com/apache/spark/pull/11984


    [Spark-14138][SQL] Fix generated SpecificColumnarIterator code can exceed 
JVM size limit for cached DataFrames

    ## What changes were proposed in this pull request?
    
    This PR reduces Java byte code size of method in 
```SpecificColumnarIterator``` by using two approaches:
    1. Generate and call ```getTYPEColumnAccessor()``` for each type, which is 
actually used, for instantiating accessors
    2. Group a lot of method calls (more than 4000) into a method
    
    ## How was this patch tested?
    
    Added a new unit test to ```InMemoryColumnarQuerySuite```
    
    Here is generate code
    
    ```java
    /* 033 */   private org.apache.spark.sql.execution.columnar.CachedBatch 
batch = null;
    /* 034 */
    /* 035 */   private 
org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor;
    /* 036 */   private 
org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor1;
    /* 037 */
    /* 038 */   public SpecificColumnarIterator() {
    /* 039 */     this.nativeOrder = ByteOrder.nativeOrder();
    /* 030 */     this.mutableRow = new MutableUnsafeRow(rowWriter);
    /* 041 */   }
    /* 042 */
    /* 043 */   public void initialize(Iterator input, DataType[] columnTypes, 
int[] columnIndexes,
    /* 044 */     boolean columnNullables[]) {
    /* 044 */     this.input = input;
    /* 046 */     this.columnTypes = columnTypes;
    /* 047 */     this.columnIndexes = columnIndexes;
    /* 048 */   }
    /* 049 */
    /* 050 */
    /* 051 */   private 
org.apache.spark.sql.execution.columnar.IntColumnAccessor 
getIntColumnAccessor(int idx) {
    /* 052 */     byte[] buffer = batch.buffers()[columnIndexes[idx]];
    /* 053 */     return new 
org.apache.spark.sql.execution.columnar.IntColumnAccessor(ByteBuffer.wrap(buffer).order(nativeOrder));
    /* 054 */   }
    /* 055 */
    /* 056 */
    /* 057 */
    /* 058 */
    /* 059 */
    /* 060 */
    /* 061 */   public boolean hasNext() {
    /* 062 */     if (currentRow < numRowsInBatch) {
    /* 063 */       return true;
    /* 064 */     }
    /* 065 */     if (!input.hasNext()) {
    /* 066 */       return false;
    /* 067 */     }
    /* 068 */
    /* 069 */     batch = (org.apache.spark.sql.execution.columnar.CachedBatch) 
input.next();
    /* 070 */     currentRow = 0;
    /* 071 */     numRowsInBatch = batch.numRows();
    /* 072 */     accessor = getIntColumnAccessor(0);
    /* 073 */     accessor1 = getIntColumnAccessor(1);
    /* 074 */
    /* 075 */     return hasNext();
    /* 076 */   }
    /* 077 */
    /* 078 */   public InternalRow next() {
    /* 079 */     currentRow += 1;
    /* 080 */     bufferHolder.reset();
    /* 081 */     rowWriter.zeroOutNullBytes();
    /* 082 */     accessor.extractTo(mutableRow, 0);
    /* 083 */     accessor1.extractTo(mutableRow, 1);
    /* 084 */     unsafeRow.setTotalSize(bufferHolder.totalSize());
    /* 085 */     return unsafeRow;
    /* 086 */   }
    ```
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark SPARK-14138

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11984.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11984
    
----
commit ab67d33787e568245c9e2ab30e51b471f21fa2ed
Author: Kazuaki Ishizaki <[email protected]>
Date:   2016-03-27T04:15:06Z

    make code size of hasNext() smaller by preparing get*Acceessor() methods
    
    group a lot of calls into a method

commit fea2a524bbd5b1d0d285e02e6eda590d1f7d67e3
Author: Kazuaki Ishizaki <[email protected]>
Date:   2016-03-27T04:15:38Z

    add test case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [Spark-14138][SQL] Fix generated SpecificColum...

Reply via email to