joechenrh opened a new pull request, #880: URL: https://github.com/apache/arrow-go/pull/880
### Rationale for this change Addresses #865. Today the reader materializes each full uncompressed data page before decoding, so peak memory scales with the page's uncompressed size — a problem for files with very large data pages (e.g. a 1 GB compressed / 1.4 GB uncompressed page). This adds an **opt-in** path that keeps peak memory near the requested batch size for the common large-page shape. Approach was discussed and agreed on the issue. ### What changes are included in this PR Opt-in via `ReaderProperties.EnablePageStreaming` (default **false**, no behavior change). When set, **eligible** data pages skip full-page decompression and decode their values incrementally; every ineligible page falls back to the existing materialized path. **Scope (first cut):** - Data Page **V1 + V2**, **PLAIN** encoding - Physical types **`byte_array`** and **`fixed_len_byte_array`** - Codecs **`UNCOMPRESSED` / `GZIP` / `BROTLI` / `ZSTD`** (explicit allowlist — Snappy is a raw block whereas its `NewReader` is the framed format, and LZ4_RAW has no streaming reader) **Design:** - `parquet/internal/encoding/streaming`: a small `ValueBuffer` — an incremental byte source that reads from an `io.Reader`, slides as values are consumed, grows to fit a single oversized value, and owns the page's stream (drained + closed on release). - The materialized `PlainByteArrayDecoder` / `PlainFixedLenByteArrayDecoder` keep their `[]byte` logic byte-for-byte, gaining only a `src` field + a one-line guard in `Decode`/`Discard`. Streaming decode lives beside them; `SetData` resets `src`, so a cached decoder reused for a materialized page reverts. Other decoders are untouched. - Page reader: for eligible pages, V2 levels are read raw (uncompressed, header byte-lengths) and V1 levels are peeled off the decompressed stream using the same length rules as `LevelDecoder.SetData`; values become a `ValueBuffer` over the compressed region. On page release the underlying stream is drained + closed so the reader lands on the next page header even if values were skipped. - Column reader: the same cached PLAIN decoder is fed via `SetSource` (streaming) or `SetData` (materialized). The public `Page`/`DataPage` interfaces are unchanged (the streaming accessor is reached via an unexported interface, so external implementations don't break). ### Are these changes tested? Yes. - All existing encoding tests pass unchanged (materialized parity); the streaming decoders are additionally exercised with a one-byte-at-a-time reader and oversized values. - End-to-end round-trip (`parquet/file/streaming_page_test.go`): streaming vs materialized reads are asserted identical across V1/V2 and each codec, for **required** and **nullable** columns (the nullable case covers the V1 def-level peel + streaming `DecodeSpaced`), including values larger than a page and the stream buffer. ### Are there any user-facing changes? One new opt-in field, `ReaderProperties.EnablePageStreaming` (default false). No change to existing behavior or public interfaces. Follow-ups (out of scope here): fixed-width numeric PLAIN types, raw-Snappy and LZ4_RAW streaming, and a memory benchmark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
