mbutrovich opened a new pull request, #892:
URL: https://github.com/apache/datafusion-comet/pull/892

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Closes #798 .
   
   ## Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in 
the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your 
changes and offer better suggestions for fixes.  
   -->
   
   CometSparkToColumnar currently iterates over input batches using a row 
iterator, regardless of whether the data is columnar or not.
   
   ## What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is 
sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   - Extended `ArrowWriter` to write an entire column at a time, with an 
optimization if the column does not contain any NULL values.
   - `ArrowBatchIterator` made into an abstract class `SparkToArrowConverter` 
with specialized subclasses depending on the source batch type 
(`ArrowBatchIteratorFromColumnBatch`, `ArrowBatchIteratorFromInternalRows`). I 
welcome naming feedback.
   - `CometSparkToColumnarExec` now uses one of these specialized iterators 
depending on the child's output.
   
   ## How are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are 
they covered by existing tests)?
   -->
   
   There is a new test to exercise various types coming out of the Spark 
Parquet reader (BatchScan), and a new config in CometExecSuite (`SQL Parquet - 
Comet (Scan disabled, Exec)`) to test this configuration. On the main branch, 
the benchmark results are:
   ```
   OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
   Apple M3 Max
   Project + Filter Exec (0.0% zeros):        Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------
   SQL Parquet - Spark                                   57             73      
    13        183.4           5.5       1.0X
   SQL Parquet - Comet (Scan)                            53             67      
    10        198.4           5.0       1.1X
   SQL Parquet - Comet (Scan, Exec)                      52             64      
     9        201.4           5.0       1.1X
   SQL Parquet - Comet (Scan disabled, Exec)             97            109      
     8        108.6           9.2       0.6X
   
   OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
   Apple M3 Max
   Project + Filter Exec (50.0% zeros):       Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------
   SQL Parquet - Spark                                   42             49      
     8        250.2           4.0       1.0X
   SQL Parquet - Comet (Scan)                            41             54      
     9        254.2           3.9       1.0X
   SQL Parquet - Comet (Scan, Exec)                      45             59      
    10        234.8           4.3       0.9X
   SQL Parquet - Comet (Scan disabled, Exec)             85             98      
    12        123.3           8.1       0.5X
   
   OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
   Apple M3 Max
   Project + Filter Exec (95.0% zeros):       Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------
   SQL Parquet - Spark                                   27             36      
    10        384.7           2.6       1.0X
   SQL Parquet - Comet (Scan)                            28             46      
    12        372.0           2.7       1.0X
   SQL Parquet - Comet (Scan, Exec)                      33             53      
    12        319.5           3.1       0.8X
   SQL Parquet - Comet (Scan disabled, Exec)             73             86      
    12        143.4           7.0       0.4X
   ```
   With the changes in this PR:
   ```
   OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
   Apple M3 Max
   Project + Filter Exec (0.0% zeros):        Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------
   SQL Parquet - Spark                                   63             78      
    12        165.3           6.1       1.0X
   SQL Parquet - Comet (Scan)                            54             72      
    12        195.2           5.1       1.2X
   SQL Parquet - Comet (Scan, Exec)                      54             66      
    11        193.9           5.2       1.2X
   SQL Parquet - Comet (Scan disabled, Exec)             79             94      
    11        133.2           7.5       0.8X
   
   OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
   Apple M3 Max
   Project + Filter Exec (50.0% zeros):       Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------
   SQL Parquet - Spark                                   43             58      
    13        246.1           4.1       1.0X
   SQL Parquet - Comet (Scan)                            42             60      
    14        250.3           4.0       1.0X
   SQL Parquet - Comet (Scan, Exec)                      46             66      
    15        229.0           4.4       0.9X
   SQL Parquet - Comet (Scan disabled, Exec)             72             94      
    15        145.5           6.9       0.6X
   
   OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
   Apple M3 Max
   Project + Filter Exec (95.0% zeros):       Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------
   SQL Parquet - Spark                                   27             40      
    14        383.7           2.6       1.0X
   SQL Parquet - Comet (Scan)                            28             47      
    14        376.2           2.7       1.0X
   SQL Parquet - Comet (Scan, Exec)                      34             56      
    15        312.5           3.2       0.8X
   SQL Parquet - Comet (Scan disabled, Exec)             63             87      
    18        166.2           6.0       0.4X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to