Re: [PR] feat(parquet/pqarrow): Add ForceLarge option [arrow-go]

via GitHub Tue, 26 Nov 2024 14:07:09 -0800


zeroshade commented on code in PR #197:
URL: https://github.com/apache/arrow-go/pull/197#discussion_r1859285011



##########
parquet/pqarrow/properties.go:
##########
@@ -165,6 +165,11 @@ type ArrowReadProperties struct {
        Parallel bool
        // BatchSize is the size used for calls to NextBatch when reading whole 
columns
        BatchSize int64
+       // Setting ForceLarge to true will force the reader to use 
LargeString/LargeBinary
+       // for string and binary columns respectively, instead of the default 
variants. This
+       // can be necessary if you know that there are columns which contain 
more than 2GB of
+       // data, which would prevent use of int32 offsets.
+       ForceLarge bool

Review Comment:
   Well, if you don't use this option, you can still read the parquet file, it 
would just require manually shrinking the batch size. I can definitely change 
this to make it a per-column option. That's fine, albeit a larger change since 
we don't currently expose which column we're determining the type for to the 
function that does the arrow type.
   
   Alternately, we could utilize the column metadata for the row groups and 
decide ahead of time to switch to utilizing the Large variant for a column if 
the metadata says that it is large enough to warrant it, but that would make 
things really complex with row groups that may or may not be large enough to 
require it, etc.
   
   The other alternative would be to forcibly reduce the batchsize when reading 
to accomodate? 
   
   Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(parquet/pqarrow): Add ForceLarge option [arrow-go]

Reply via email to