zeroshade commented on code in PR #197:
URL: https://github.com/apache/arrow-go/pull/197#discussion_r1859285011
##########
parquet/pqarrow/properties.go:
##########
@@ -165,6 +165,11 @@ type ArrowReadProperties struct {
Parallel bool
// BatchSize is the size used for calls to NextBatch when reading whole
columns
BatchSize int64
+ // Setting ForceLarge to true will force the reader to use
LargeString/LargeBinary
+ // for string and binary columns respectively, instead of the default
variants. This
+ // can be necessary if you know that there are columns which contain
more than 2GB of
+ // data, which would prevent use of int32 offsets.
+ ForceLarge bool
Review Comment:
Well, if you don't use this option, you can still read the parquet file, it
would just require manually shrinking the batch size. I can definitely change
this to make it a per-column option. That's fine, albeit a larger change since
we don't currently expose which column we're determining the type for to the
function that does the arrow type.
Alternately, we could utilize the column metadata for the row groups and
decide ahead of time to switch to utilizing the Large variant for a column if
the metadata says that it is large enough to warrant it, but that would make
things really complex with row groups that may or may not be large enough to
require it, etc.
The other alternative would be to forcibly reduce the batchsize when reading
to accomodate?
Thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]