joechenrh opened a new pull request, #880:
URL: https://github.com/apache/arrow-go/pull/880

   ### Rationale for this change
   
   Addresses #865. Today the reader materializes each full uncompressed data 
page before decoding, so peak memory scales with the page's uncompressed size — 
a problem for files with very large data pages (e.g. a 1 GB compressed / 1.4 GB 
uncompressed page). This adds an **opt-in** path that keeps peak memory near 
the requested batch size for the common large-page shape.
   
   Approach was discussed and agreed on the issue.
   
   ### What changes are included in this PR
   
   Opt-in via `ReaderProperties.EnablePageStreaming` (default **false**, no 
behavior change). When set, **eligible** data pages skip full-page 
decompression and decode their values incrementally; every ineligible page 
falls back to the existing materialized path.
   
   **Scope (first cut):**
   - Data Page **V1 + V2**, **PLAIN** encoding
   - Physical types **`byte_array`** and **`fixed_len_byte_array`**
   - Codecs **`UNCOMPRESSED` / `GZIP` / `BROTLI` / `ZSTD`** (explicit allowlist 
— Snappy is a raw block whereas its `NewReader` is the framed format, and 
LZ4_RAW has no streaming reader)
   
   **Design:**
   - `parquet/internal/encoding/streaming`: a small `ValueBuffer` — an 
incremental byte source that reads from an `io.Reader`, slides as values are 
consumed, grows to fit a single oversized value, and owns the page's stream 
(drained + closed on release).
   - The materialized `PlainByteArrayDecoder` / `PlainFixedLenByteArrayDecoder` 
keep their `[]byte` logic byte-for-byte, gaining only a `src` field + a 
one-line guard in `Decode`/`Discard`. Streaming decode lives beside them; 
`SetData` resets `src`, so a cached decoder reused for a materialized page 
reverts. Other decoders are untouched.
   - Page reader: for eligible pages, V2 levels are read raw (uncompressed, 
header byte-lengths) and V1 levels are peeled off the decompressed stream using 
the same length rules as `LevelDecoder.SetData`; values become a `ValueBuffer` 
over the compressed region. On page release the underlying stream is drained + 
closed so the reader lands on the next page header even if values were skipped.
   - Column reader: the same cached PLAIN decoder is fed via `SetSource` 
(streaming) or `SetData` (materialized). The public `Page`/`DataPage` 
interfaces are unchanged (the streaming accessor is reached via an unexported 
interface, so external implementations don't break).
   
   ### Are these changes tested?
   
   Yes.
   - All existing encoding tests pass unchanged (materialized parity); the 
streaming decoders are additionally exercised with a one-byte-at-a-time reader 
and oversized values.
   - End-to-end round-trip (`parquet/file/streaming_page_test.go`): streaming 
vs materialized reads are asserted identical across V1/V2 and each codec, for 
**required** and **nullable** columns (the nullable case covers the V1 
def-level peel + streaming `DecodeSpaced`), including values larger than a page 
and the stream buffer.
   
   ### Are there any user-facing changes?
   
   One new opt-in field, `ReaderProperties.EnablePageStreaming` (default 
false). No change to existing behavior or public interfaces.
   
   Follow-ups (out of scope here): fixed-width numeric PLAIN types, raw-Snappy 
and LZ4_RAW streaming, and a memory benchmark.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to