nguyennk92 commented on PR #17347: URL: https://github.com/apache/beam/pull/17347#issuecomment-1097659294
> > A few changes (mostly the Apache licensing) before this looks good. I'd also suggest looking at what it would take to write a Splittable DoFn (https://beam.apache.org/documentation/programming-guide/#splittable-dofns) version of this so the reads could scale. > > It looks like that parquet package could easily support a per-record level read for subfile splitting too. The metadata includes the [number of rows](https://pkg.go.dev/github.com/xitongsys/parquet-go/reader#ParquetReader.GetNumRows), and you can also [skip them](https://pkg.go.dev/github.com/xitongsys/parquet-go/reader#ParquetReader.SkipRows). Yes. I use a quite naive approach that requires reading the whole parquet file into memory. Unlike Java, the `filesystem.OpenRead()` doesn't support `io.Seeker`, which is a required feature for processing Parquet files. I am trying to implement `OpenRead()` to return `io.ReadSeekCloser` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
