corwinjoy commented on code in PR #39677:
URL: https://github.com/apache/arrow/pull/39677#discussion_r1456613598
##########
cpp/src/generated/parquet_types.cpp:
##########
@@ -8412,6 +8413,24 @@ std::ostream& operator<<(std::ostream& out, const
FileMetaData& obj)
return out;
}
+// As far as I can tell, the thrift protocol interface does not have a way to
directly skip
+// bytes without parsing them.
+// For now, add a hack to access the transport protocol directly to add a
+// skip_bytes method.
+// If we decide to go this way, we can upgrade the thrift library to add this
method.
+class CastCompactProtocol {
+public:
+ void *vtbl;
+ ::apache::thrift::transport::TBufferBase *trans_;
+};
Review Comment:
I realize the above is a hack, but this is more for proof of concept. See
the discussion below.
##########
cpp/src/generated/parquet_types.cpp:
##########
@@ -8481,11 +8501,27 @@ uint32_t
FileMetaData::read(::apache::thrift::protocol::TProtocol* iprot) {
uint32_t _size326;
::apache::thrift::protocol::TType _etype329;
xfer += iprot->readListBegin(_etype329, _size326);
+ if(read_only_rowgroup_0) {
+ this->row_groups.resize(1);
+ } else {
this->row_groups.resize(_size326);
+ }
+
uint32_t _i330;
+ uint32_t rowgroup_size;
for (_i330 = 0; _i330 < _size326; ++_i330)
{
- xfer += this->row_groups[_i330].read(iprot);
+ rowgroup_size = this->row_groups[_i330].read(iprot);
+ xfer += rowgroup_size;
+ if(read_only_rowgroup_0) {
+ break;
+ }
+ }
+ if(read_only_rowgroup_0) {
+ // skip the remaining rowgroups
+ uint32_t skip_len = (_size326 -1) * rowgroup_size;
+ skip_bytes(iprot, skip_len);
+ xfer += skip_len;
Review Comment:
This is probably unsafe since the compact thrift protocol may possibly have
a variable number of bytes for each rowgroup. (Though this works in the simple
test cases).
I can think of a few other options and would appreciate suggestions:
1. Just exit after reading the first rowgroup and fill out the remaining
fields with defaults. (This would be a bummer because then it could not support
encryption).
2. Reorder the items in the thrift file so that the (long) rowgroup section
comes last.
3. Change the parquet format so that metadata is encoded in a binary rather
than a compact format to allow random access / skipping sections.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]