marcin-krystianc commented on issue #38149: URL: https://github.com/apache/arrow/issues/38149#issuecomment-1830137167
> Also we don't need to keep the file open. Just caching the `FileMetadata` like what it does in `ParquetFragment` That is a quite good workaround, but unfortunately, it will not work for us, In our scenario, it is not feasible to keep the metadata in memory for all the files we are working with. We would run out of RAM if we did that. > Also you can try to profile the deserialization, lets find should possible optimizations here. I think the current thrift deserialzation might be low performance.. It "feels" to be slow but there are no obvious culprits: - I tried to use the "skip" method to skip reading some row groups and columns. Even if we skip everything it only cuts the decoding time in half. Note that if we skip-reading the metadata data, there are basically no virtual calls, and also a lot of objects to store the column and row metadata don't need to be created. - I tried a profiler but doesn't show anything obvious - My hunch is, (but I don't know how to easily change the implementation to prove it) is that the main inefficiency is caused by memory-copying the data from the input buffer "byte by byte", e.g: ``` cpp libparquet.so.1500!apache::thrift::transport::TBufferBase::readAll(apache::thrift::transport::TBufferBase * const this, uint8_t * buf, uint32_t len) (\usr\include\thrift\transport\TBufferTransports.h:81) libparquet.so.1500!apache::thrift::transport::TMemoryBuffer::readAll(apache::thrift::transport::TMemoryBuffer * const this, uint8_t * buf, uint32_t len) (\usr\include\thrift\transport\TBufferTransports.h:696) libparquet.so.1500!apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>::readByte(apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer> * const this, int8_t & byte) (\usr\include\thrift\protocol\TCompactProtocol.tcc:620) libparquet.so.1500!apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>::readFieldBegin(apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer> * const this, std::string & name, apache::thrift::protocol::TType & fieldType, int16_t & fieldId) (\usr\include\thrift\protocol\TCompactProtocol.tcc:481) libparquet.so.1500!apache::thrift::protocol::TVirtualProtocol<apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>, apache::thrift::protocol::TProtocolDefaults>::readFieldBegin_virt(apache::thrift::protocol::TVirtualProtocol<apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>, apache::thrift::protocol::TProtocolDefaults> * const this, std::string & name, apache::thrift::protocol::TType & fieldType, int16_t & fieldId) (\usr\include\thrift\protocol\TVirtualProtocol.h:415) libparquet.so.1500!apache::thrift::protocol::TProtocol::readFieldBegin(apache::thrift::protocol::TProtocol * const this, std::string & name, apache::thrift::protocol::TType & fieldType, int16_t & fieldId) (\usr\include\thrift\protocol\TProtocol.h:423) libparquet.so.1500!parquet::format::FileMetaData::read(parquet::format::FileMetaData * const this, apache::thrift::protocol::TProtocol * iprot) (\src\arrow\cpp\src\generated\parquet_types.cpp:8011) libparquet.so.1500!parquet::ThriftDeserializer::DeserializeUnencryptedMessage<parquet::format::FileMetaData>(parquet::ThriftDeserializer * const this, const uint8_t * buf, uint32_t * len, parquet::format::FileMetaData * deserialized_msg) (\src\arrow\cpp\src\parquet\thrift_internal.h:455) libparquet.so.1500!parquet::ThriftDeserializer::DeserializeMessage<parquet::format::FileMetaData>(parquet::ThriftDeserializer * const this, const uint8_t * buf, uint32_t * len, parquet::format::FileMetaData * deserialized_msg, parquet::Decryptor * decryptor) (\src\arrow\cpp\src\parquet\thrift_internal.h:409) libparquet.so.1500!parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl(parquet::FileMetaData::FileMetaDataImpl * const this, const void * metadata, uint32_t * metadata_len, parquet::ReaderProperties properties, std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) (\src\arrow\cpp\src\parquet\metadata.cc:606) libparquet.so.1500!parquet::FileMetaData::FileMetaData(parquet::FileMetaData * const this, const void * metadata, uint32_t * metadata_len, const parquet::ReaderProperties & properties, std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) (\src\arrow\cpp\src\parquet\metadata.cc:884) libparquet.so.1500!parquet::FileMetaData::Make(const void * metadata, uint32_t * metadata_len, const parquet::ReaderProperties & properties, std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) (\src\arrow\cpp\src\parquet\metadata.cc:871) libparquet.so.1500!parquet::SerializedFile::ParseUnencryptedFileMetadata(parquet::SerializedFile * const this, const std::shared_ptr<arrow::Buffer> & metadata_buffer, const uint32_t metadata_len) (\src\arrow\cpp\src\parquet\file_reader.cc:626) libparquet.so.1500!parquet::SerializedFile::ParseMetaData(parquet::SerializedFile * const this) (\src\arrow\cpp\src\parquet\file_reader.cc:444) libparquet.so.1500!parquet::ParquetFileReader::Contents::Open(std::shared_ptr<arrow::io::RandomAccessFile> source, const parquet::ReaderProperties & props, std::shared_ptr<parquet::FileMetaData> metadata) (\src\arrow\cpp\src\parquet\file_reader.cc:764) libparquet.so.1500!parquet::ParquetFileReader::Open(std::shared_ptr<arrow::io::RandomAccessFile> source, const parquet::ReaderProperties & props, std::shared_ptr<parquet::FileMetaData> metadata) (\src\arrow\cpp\src\parquet\file_reader.cc:802) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
