corwinjoy commented on issue #39676: URL: https://github.com/apache/arrow/issues/39676#issuecomment-1901466983
@mapleFU wrote: > I understand why don't read all row-group metadata, but why a "first RowGroup" is read in this experiment? Since we already has schema here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1116 I like this idea and I think it has the potential to be even faster than what I have done in the PR. To be specific in parquet_types.cpp:8435 we have: ``` uint32_t FileMetaData::read(::apache::thrift::protocol::TProtocol* iprot) { ::apache::thrift::protocol::TInputRecursionTracker tracker(*iprot); uint32_t xfer = 0; std::string fname; ::apache::thrift::protocol::TType ftype; int16_t fid; bool read_only_rowgroup_0 = this->read_only_rowgroup_0; xfer += iprot->readStructBegin(fname); using ::apache::thrift::protocol::TProtocolException; bool isset_version = false; bool isset_schema = false; bool isset_num_rows = false; bool isset_row_groups = false; while (true) { xfer += iprot->readFieldBegin(fname, ftype, fid); if (ftype == ::apache::thrift::protocol::T_STOP) { break; } switch (fid) { case 1: if (ftype == ::apache::thrift::protocol::T_I32) { xfer += iprot->readI32(this->version); isset_version = true; } else { xfer += iprot->skip(ftype); } break; case 2: if (ftype == ::apache::thrift::protocol::T_LIST) { { this->schema.clear(); uint32_t _size321; ::apache::thrift::protocol::TType _etype324; xfer += iprot->readListBegin(_etype324, _size321); this->schema.resize(_size321); uint32_t _i325; for (_i325 = 0; _i325 < _size321; ++_i325) { xfer += this->schema[_i325].read(iprot); } xfer += iprot->readListEnd(); } isset_schema = true; } else { xfer += iprot->skip(ftype); } break; ... ``` So the second item read is actually this schema which doesn't even show up in the profile so I think it may be quite fast. But, there is a problem, I think we still need to construct a prototype RowGroup from it in order for the readers to work. @mapleFU - do you know how to create a RowGroup from the schema? Is there such a function? As to why I am using RowGroups the data readers seem to be intimately tied up with using the RowGroup metadata information. For example: reader.cc: 262 ``` Status ReadColumn(int i, const std::vector<int>& row_groups, ColumnReader* reader, std::shared_ptr<ChunkedArray>* out) { BEGIN_PARQUET_CATCH_EXCEPTIONS // TODO(wesm): This calculation doesn't make much sense when we have repeated // schema nodes int64_t records_to_read = 0; for (auto row_group : row_groups) { // Can throw exception records_to_read += reader_->metadata()->RowGroup(row_group)->ColumnChunk(i)->num_values(); } ... return reader->NextBatch(records_to_read, out); # and this uses the data page records from the rowgroup } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
