AaronReboot opened a new issue, #820: URL: https://github.com/apache/arrow-go/issues/820
### Describe the bug, including details regarding any error messages, version, and platform. `(*file.Writer).startFile()` panics with `"failed to write magic number"` when the underlying sink's first `Write` returns an error or short-writes. Because `NewParquetWriter` calls `startFile` synchronously inside the constructor — and `NewParquetWriter` has no error return — there is no way for a consumer to recover via the standard Go error handling pattern. **Code** ([`parquet/file/file_writer.go:201-204`](https://github.com/apache/arrow-go/blob/main/parquet/file/file_writer.go#L201-L204)): ```go n, err := fw.sink.Write(magic) if n != 4 || err != nil { panic("failed to write magic number") } ``` **Constructor signature** ([`parquet/file/file_writer.go:72`](https://github.com/apache/arrow-go/blob/main/parquet/file/file_writer.go#L72)): ```go func NewParquetWriter(w io.Writer, sc *schema.GroupNode, opts ...WriteOption) *Writer ``` The higher-level [`pqarrow.NewFileWriter`](https://github.com/apache/arrow-go/blob/main/parquet/pqarrow/file_writer.go#L135) does have an `error` return, but it propagates the panic from `file.NewParquetWriter` through unconverted — so callers of either constructor must wrap the call in `defer recover()` to handle a flaky sink. ### Why this matters in practice When the `sink` is backed by a network-attached writer — e.g., a `*storage.Writer` from `cloud.google.com/go/storage`, a Cloud Storage resumable upload, an HDFS client, an S3 multipart upload — a transient network blip on the first 4-byte write reliably crashes the consumer process. Concrete example: a Kubernetes-deployed Go service that writes ~thousands of Parquet files per day to GCS sees ~1-2 pod crashes per 5 pod-hours under normal cluster conditions, all from this panic. Each crash: 1. Loses the entire worker process (not just the failing call) 2. Forces a container restart 3. Leaves any in-flight resumable upload session orphaned in GCS until session timeout 4. Requires application-level orphan-recovery machinery to reclaim work The panic stack consistently looks like: ``` panic: failed to write magic number goroutine N [running]: github.com/apache/arrow-go/v18/parquet/file.(*Writer).startFile(...) github.com/apache/arrow-go/[email protected]/parquet/file/file_writer.go:203 +0x257 github.com/apache/arrow-go/v18/parquet/file.NewParquetWriter(...) github.com/apache/arrow-go/[email protected]/parquet/file/file_writer.go:90 +0x2d4 github.com/apache/arrow-go/v18/parquet/pqarrow.NewFileWriter(...) github.com/apache/arrow-go/[email protected]/parquet/pqarrow/file_writer.go:166 +0x34d <consumer code calling pqarrow.NewFileWriter> ``` Workarounds exist (`defer recover()` at the call site) but they're brittle: - Tied to an undocumented panic-string format - Forced on every consumer of the public API - Leave any partial sink state (open upload session, buffered bytes) for the caller to clean up ### Expected behavior Per Go's errors-as-values convention and the contract suggested by `pqarrow.NewFileWriter`'s `(*FileWriter, error)` signature, a failure to write the magic header should surface as an error to the caller, not a panic. ### Suggested fixes (smallest change first) 1. **Add a sibling constructor** that returns `(*Writer, error)`, e.g.: ```go func NewParquetWriterWithError(w io.Writer, sc *schema.GroupNode, opts ...WriteOption) (*Writer, error) ``` Non-breaking. Existing callers keep current panic semantics; new callers opt into clean error handling. `pqarrow.NewFileWriter` switches to the new constructor internally so its `error` return becomes meaningful for this failure mode. 2. **Defer `startFile()` until first append.** Move the magic-header write to lazy execution on the first `AppendRowGroup` / `AppendBufferedRowGroup` call — both of which already have error paths. Slightly larger change but doesn't add a new public symbol. 3. **Lazy error pattern.** Have `startFile` store its error in an unexported `Writer.initErr` field, and have every other public method (`AppendRowGroup`, `Close`, `WritePageIndex`, …) short-circuit on it. Non-breaking but touches every public method. ### Reproduction sketch ```go // Synthetic flaky sink that fails the first Write. type flakySink struct{ writes int } func (f *flakySink) Write(p []byte) (int, error) { f.writes++ if f.writes == 1 { return 0, errors.New("synthesized network blip") } return len(p), nil } func main() { // ... build a trivial schema.GroupNode ... defer func() { if r := recover(); r != nil { log.Printf("recovered: %v", r) // prints "failed to write magic number" } }() _ = file.NewParquetWriter(&flakySink{}, schemaNode) // panics here } ``` ### Component(s) Parquet ### Version `github.com/apache/arrow-go/[email protected]` (also reproducible on current `main`; the [code at line 201-203](https://github.com/apache/arrow-go/blob/main/parquet/file/file_writer.go#L201-L204) is unchanged). ### Related [#794](https://github.com/apache/arrow-go/issues/794) — adjacent territory (Close() buffer leak when io.Writer fails mid-flush), different code path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
