[GitHub] [arrow] emkornfield commented on a change in pull request #11146: ARROW-13984: [Go][Parquet] file handling for go parquet, just the readers

GitBox Sat, 23 Oct 2021 10:17:52 -0700


emkornfield commented on a change in pull request #11146:
URL: https://github.com/apache/arrow/pull/11146#discussion_r734992683




##########
File path: go/parquet/file/row_group_reader.go
##########
@@ -0,0 +1,130 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package file
+
+import (
+       "github.com/apache/arrow/go/arrow/ipc"
+       "github.com/apache/arrow/go/parquet"
+       "github.com/apache/arrow/go/parquet/internal/encryption"
+       "github.com/apache/arrow/go/parquet/internal/utils"
+       "github.com/apache/arrow/go/parquet/metadata"
+       "golang.org/x/xerrors"
+)
+
+const (
+       maxDictHeaderSize int64 = 100
+)
+
+// RowGroupReader is the primary interface for reading a single row group
+type RowGroupReader struct {
+       r             ipc.ReadAtSeeker
+       sourceSz      int64
+       fileMetadata  *metadata.FileMetaData
+       rgMetadata    *metadata.RowGroupMetaData
+       props         *parquet.ReaderProperties
+       fileDecryptor encryption.FileDecryptor
+}
+
+// MetaData returns the metadata of the current Row Group
+func (r *RowGroupReader) MetaData() *metadata.RowGroupMetaData { return 
r.rgMetadata }
+
+// NumColumns returns the number of columns of data as defined in the metadata 
of this row group
+func (r *RowGroupReader) NumColumns() int { return r.rgMetadata.NumColumns() }
+
+// NumRows returns the number of rows in just this row group
+func (r *RowGroupReader) NumRows() int64 { return r.rgMetadata.NumRows() }
+
+// ByteSize returns the full byte size of this row group as defined in its 
metadata
+func (r *RowGroupReader) ByteSize() int64 { return 
r.rgMetadata.TotalByteSize() }
+
+// Column returns a column reader for the requested (0-indexed) column
+//
+// panics if passed a column not in the range [0, NumColumns)
+func (r *RowGroupReader) Column(i int) ColumnChunkReader {
+       if i >= r.NumColumns() || i < 0 {
+               panic(xerrors.Errorf("parquet: trying to read column index %d 
but row group metadata only has %d columns", i, r.rgMetadata.NumColumns()))
+       }
+
+       descr := r.fileMetadata.Schema.Column(i)
+       pageRdr, err := r.GetColumnPageReader(i)
+       if err != nil {
+               panic(xerrors.Errorf("parquet: unable to initialize page 
reader: %w", err))
+       }
+       return NewColumnReader(descr, pageRdr, r.props.Allocator())
+}
+
+func (r *RowGroupReader) GetColumnPageReader(i int) (PageReader, error) {
+       col, err := r.rgMetadata.ColumnChunk(i)
+       if err != nil {
+               return nil, err
+       }
+
+       colStart := col.DataPageOffset()
+       if col.HasDictionaryPage() && col.DictionaryPageOffset() > 0 && 
colStart > col.DictionaryPageOffset() {
+               colStart = col.DictionaryPageOffset()
+       }
+
+       colLen := col.TotalCompressedSize()
+       if 
r.fileMetadata.WriterVersion().LessThan(metadata.Parquet816FixedVersion) {
+               bytesRemain := r.sourceSz - (colStart + colLen)

Review comment:
       could you replicate the comment from C++ on why this is being done?  
   
   Also, in C++ we ran into some UBSAN issues due to overflow, it would be 
could to validate someplace that colStart and colLen are both positive and 
invidiually less then r.SourceSz.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] emkornfield commented on a change in pull request #11146: ARROW-13984: [Go][Parquet] file handling for go parquet, just the readers

Reply via email to