mbutrovich opened a new pull request, #892:
URL: https://github.com/apache/datafusion-comet/pull/892
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax. For example `Closes #123`
indicates that this PR will close issue #123.
-->
Closes #798 .
## Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in
the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your
changes and offer better suggestions for fixes.
-->
CometSparkToColumnar currently iterates over input batches using a row
iterator, regardless of whether the data is columnar or not.
## What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is
sometimes worth providing a summary of the individual changes in this PR.
-->
- Extended `ArrowWriter` to write an entire column at a time, with an
optimization if the column does not contain any NULL values.
- `ArrowBatchIterator` made into an abstract class `SparkToArrowConverter`
with specialized subclasses depending on the source batch type
(`ArrowBatchIteratorFromColumnBatch`, `ArrowBatchIteratorFromInternalRows`). I
welcome naming feedback.
- `CometSparkToColumnarExec` now uses one of these specialized iterators
depending on the child's output.
## How are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example, are
they covered by existing tests)?
-->
There is a new test to exercise various types coming out of the Spark
Parquet reader (BatchScan), and a new config in CometExecSuite (`SQL Parquet -
Comet (Scan disabled, Exec)`) to test this configuration. On the main branch,
the benchmark results are:
```
OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
Apple M3 Max
Project + Filter Exec (0.0% zeros): Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark 57 73
13 183.4 5.5 1.0X
SQL Parquet - Comet (Scan) 53 67
10 198.4 5.0 1.1X
SQL Parquet - Comet (Scan, Exec) 52 64
9 201.4 5.0 1.1X
SQL Parquet - Comet (Scan disabled, Exec) 97 109
8 108.6 9.2 0.6X
OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
Apple M3 Max
Project + Filter Exec (50.0% zeros): Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark 42 49
8 250.2 4.0 1.0X
SQL Parquet - Comet (Scan) 41 54
9 254.2 3.9 1.0X
SQL Parquet - Comet (Scan, Exec) 45 59
10 234.8 4.3 0.9X
SQL Parquet - Comet (Scan disabled, Exec) 85 98
12 123.3 8.1 0.5X
OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
Apple M3 Max
Project + Filter Exec (95.0% zeros): Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark 27 36
10 384.7 2.6 1.0X
SQL Parquet - Comet (Scan) 28 46
12 372.0 2.7 1.0X
SQL Parquet - Comet (Scan, Exec) 33 53
12 319.5 3.1 0.8X
SQL Parquet - Comet (Scan disabled, Exec) 73 86
12 143.4 7.0 0.4X
```
With the changes in this PR:
```
OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
Apple M3 Max
Project + Filter Exec (0.0% zeros): Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark 63 78
12 165.3 6.1 1.0X
SQL Parquet - Comet (Scan) 54 72
12 195.2 5.1 1.2X
SQL Parquet - Comet (Scan, Exec) 54 66
11 193.9 5.2 1.2X
SQL Parquet - Comet (Scan disabled, Exec) 79 94
11 133.2 7.5 0.8X
OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
Apple M3 Max
Project + Filter Exec (50.0% zeros): Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark 43 58
13 246.1 4.1 1.0X
SQL Parquet - Comet (Scan) 42 60
14 250.3 4.0 1.0X
SQL Parquet - Comet (Scan, Exec) 46 66
15 229.0 4.4 0.9X
SQL Parquet - Comet (Scan disabled, Exec) 72 94
15 145.5 6.9 0.6X
OpenJDK 64-Bit Server VM 11.0.24+8-LTS on Mac OS X 14.6.1
Apple M3 Max
Project + Filter Exec (95.0% zeros): Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark 27 40
14 383.7 2.6 1.0X
SQL Parquet - Comet (Scan) 28 47
14 376.2 2.7 1.0X
SQL Parquet - Comet (Scan, Exec) 34 56
15 312.5 3.2 0.8X
SQL Parquet - Comet (Scan disabled, Exec) 63 87
18 166.2 6.0 0.4X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]