zeroshade opened a new pull request, #707:
URL: https://github.com/apache/arrow-go/pull/707

   ### Rationale for this change
   Boolean columns currently get double converted when transferring between 
Arrow and Parquet
   
   ### What changes are included in this PR?
   
   **1. Arrow bitutil (`arrow/bitutil/bitmaps.go`)**
   - Added `AppendBitmap()` method to `BitmapWriter`
   - Directly copies bits from source bitmap using efficient `CopyBitmap()`
   
   **2. Parquet encoder (`parquet/internal/encoding/boolean_encoder.go`)**
   - Added `PutBitmap()` method to `PlainBooleanEncoder`
   - Writes bitmap data directly without bool slice conversion
   
   **3. Parquet decoder (`parquet/internal/encoding/boolean_decoder.go`)**
   - Added `DecodeToBitmap()` method to `PlainBooleanDecoder`
   - Reads directly into output bitmap
   - Optimized fast path for byte-aligned cases
   
   **4. Column writer (`parquet/file/column_writer_types.gen.go`)**
   - Added `WriteBitmapBatch()` for non-nullable boolean columns
   - Added `WriteBitmapBatchSpaced()` for nullable boolean columns
   - Internal helper methods `writeBitmapValues()` and 
`writeBitmapValuesSpaced()`
   
   **5. Arrow-Parquet bridge (`parquet/pqarrow/encode_arrow.go`)**
   - Modified `writeDenseArrow()` to detect boolean arrays
   - Uses bitmap methods when available
   - Falls back to original `[]bool` path if needed (backward compatible)
   
   
   ### Are these changes tested?
   
   Yes, and new benchmarks are added as appropriate
   
   ### Are there any user-facing changes?
   
   Just performance:
   
   ### Non-Nullable Boolean Columns
   ```
   BenchmarkBooleanBitmapWrite/1K-16          314847    19126 ns/op    6.54 
MB/s    36057 B/op    237 allocs/op
   BenchmarkBooleanBitmapWrite/10K-16         174715    33985 ns/op   36.78 
MB/s    53266 B/op    247 allocs/op
   BenchmarkBooleanBitmapWrite/100K-16         34099   175655 ns/op   71.16 
MB/s   218866 B/op    340 allocs/op
   BenchmarkBooleanBitmapWrite/1M-16            3778  1568818 ns/op   79.68 
MB/s  1763712 B/op   1237 allocs/op
   ```
   
   ### Nullable Boolean Columns (10% null rate)
   ```
   BenchmarkBooleanBitmapWriteNullable/1K-16   214921    28002 ns/op    4.46 
MB/s    39706 B/op    249 allocs/op
   BenchmarkBooleanBitmapWriteNullable/10K-16   44618   134483 ns/op    9.29 
MB/s   113690 B/op    268 allocs/op
   BenchmarkBooleanBitmapWriteNullable/100K-16   5239  1149658 ns/op   10.87 
MB/s   657178 B/op    451 allocs/op
   BenchmarkBooleanBitmapWriteNullable/1M-16      556 10926274 ns/op   11.44 
MB/s  5575200 B/op   2219 allocs/op
   ```
   
   **Key Observations:**
   - Direct bitmap path successfully avoids `[]bool` conversion
   - Throughput scales well with data size (6.5 → 80 MB/s for non-nullable)
   - Memory usage remains efficient with minimal allocations per operation
   - Nullable columns have overhead from validity bitmap processing (expected)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to