[GitHub] [beam] nguyennk92 commented on pull request #17347: implement parquetio to read/write parquet files

GitBox Wed, 13 Apr 2022 00:36:41 -0700


nguyennk92 commented on PR #17347:
URL: https://github.com/apache/beam/pull/17347#issuecomment-1097659294


   > > A few changes (mostly the Apache licensing) before this looks good. I'd 
also suggest looking at what it would take to write a Splittable DoFn 
(https://beam.apache.org/documentation/programming-guide/#splittable-dofns) 
version of this so the reads could scale.
   > 
   > It looks like that parquet package could easily support a per-record level 
read for subfile splitting too. The metadata includes the [number of 
rows](https://pkg.go.dev/github.com/xitongsys/parquet-go/reader#ParquetReader.GetNumRows),
 and you can also [skip 
them](https://pkg.go.dev/github.com/xitongsys/parquet-go/reader#ParquetReader.SkipRows).
   
   Yes. I use a quite naive approach that requires reading the whole parquet 
file into memory. Unlike Java, the `filesystem.OpenRead()` doesn't support 
`io.Seeker`, which is a required feature for processing Parquet files. I am 
trying to implement `OpenRead()` to return `io.ReadSeekCloser`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] nguyennk92 commented on pull request #17347: implement parquetio to read/write parquet files

Reply via email to