Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/10628#discussion_r49392863
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
---
@@ -0,0 +1,165 @@
+package org.apache.spark.sql.execution.vectorized;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DoubleType;
+import org.apache.spark.sql.types.IntegerType;
+import org.apache.spark.unsafe.Platform;
+
+import java.nio.ByteBuffer;
+import java.nio.DoubleBuffer;
+
+/**
+ * A column backed by an in memory JVM array. This stores the NULLs as a
byte per value
+ * and a java array for the values.
+ */
+public final class OnHeapColumnVector extends ColumnVector {
+ // The data stored in these arrays need to maintain binary compatible.
We can
+ // directly pass this buffer to external components.
+
+ // This is faster than a boolean array and we optimize this over memory
footprint.
--- End diff --
Added a bitset/bytearray benchmark. Bytes are 50% faster. We can revisit
though. For offheap in partiuclary, the popcnt instructions are useful.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]