marcin-krystianc commented on issue #38149:
URL: https://github.com/apache/arrow/issues/38149#issuecomment-1830137167

   
   > Also we don't need to keep the file open. Just caching the `FileMetadata` 
like what it does in `ParquetFragment`
   
   That is a quite good workaround, but unfortunately, it will not work for us, 
   In our scenario, it is not feasible to keep the metadata in memory for all 
the files we are working with. We would run out of RAM if we did that.
   
   > Also you can try to profile the deserialization, lets find should possible 
optimizations here. I think the current thrift deserialzation might be low 
performance..
   
   It "feels" to be slow but there are no obvious culprits: 
   
   - I tried to use the "skip" method to skip reading some row groups and 
columns. Even if we skip everything it only cuts the decoding time in half. 
Note that if we skip-reading the metadata data, there are basically no virtual 
calls, and also a lot of objects to store the column and row metadata don't 
need to be created.
   - I tried a profiler but doesn't show anything obvious
   - My hunch is, (but I don't know how to easily change the implementation to 
prove it) is that the main inefficiency is caused by memory-copying the data 
from the input buffer "byte by byte", e.g:
   ``` cpp
   
libparquet.so.1500!apache::thrift::transport::TBufferBase::readAll(apache::thrift::transport::TBufferBase
 * const this, uint8_t * buf, uint32_t len) 
(\usr\include\thrift\transport\TBufferTransports.h:81)
   
libparquet.so.1500!apache::thrift::transport::TMemoryBuffer::readAll(apache::thrift::transport::TMemoryBuffer
 * const this, uint8_t * buf, uint32_t len) 
(\usr\include\thrift\transport\TBufferTransports.h:696)
   
libparquet.so.1500!apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>::readByte(apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>
 * const this, int8_t & byte) 
(\usr\include\thrift\protocol\TCompactProtocol.tcc:620)
   
libparquet.so.1500!apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>::readFieldBegin(apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>
 * const this, std::string & name, apache::thrift::protocol::TType & fieldType, 
int16_t & fieldId) (\usr\include\thrift\protocol\TCompactProtocol.tcc:481)
   
libparquet.so.1500!apache::thrift::protocol::TVirtualProtocol<apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>,
 
apache::thrift::protocol::TProtocolDefaults>::readFieldBegin_virt(apache::thrift::protocol::TVirtualProtocol<apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>,
 apache::thrift::protocol::TProtocolDefaults> * const this, std::string & name, 
apache::thrift::protocol::TType & fieldType, int16_t & fieldId) 
(\usr\include\thrift\protocol\TVirtualProtocol.h:415)
   
libparquet.so.1500!apache::thrift::protocol::TProtocol::readFieldBegin(apache::thrift::protocol::TProtocol
 * const this, std::string & name, apache::thrift::protocol::TType & fieldType, 
int16_t & fieldId) (\usr\include\thrift\protocol\TProtocol.h:423)
   
libparquet.so.1500!parquet::format::FileMetaData::read(parquet::format::FileMetaData
 * const this, apache::thrift::protocol::TProtocol * iprot) 
(\src\arrow\cpp\src\generated\parquet_types.cpp:8011)
   
libparquet.so.1500!parquet::ThriftDeserializer::DeserializeUnencryptedMessage<parquet::format::FileMetaData>(parquet::ThriftDeserializer
 * const this, const uint8_t * buf, uint32_t * len, 
parquet::format::FileMetaData * deserialized_msg) 
(\src\arrow\cpp\src\parquet\thrift_internal.h:455)
   
libparquet.so.1500!parquet::ThriftDeserializer::DeserializeMessage<parquet::format::FileMetaData>(parquet::ThriftDeserializer
 * const this, const uint8_t * buf, uint32_t * len, 
parquet::format::FileMetaData * deserialized_msg, parquet::Decryptor * 
decryptor) (\src\arrow\cpp\src\parquet\thrift_internal.h:409)
   
libparquet.so.1500!parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl(parquet::FileMetaData::FileMetaDataImpl
 * const this, const void * metadata, uint32_t * metadata_len, 
parquet::ReaderProperties properties, 
std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) 
(\src\arrow\cpp\src\parquet\metadata.cc:606)
   libparquet.so.1500!parquet::FileMetaData::FileMetaData(parquet::FileMetaData 
* const this, const void * metadata, uint32_t * metadata_len, const 
parquet::ReaderProperties & properties, 
std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) 
(\src\arrow\cpp\src\parquet\metadata.cc:884)
   libparquet.so.1500!parquet::FileMetaData::Make(const void * metadata, 
uint32_t * metadata_len, const parquet::ReaderProperties & properties, 
std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) 
(\src\arrow\cpp\src\parquet\metadata.cc:871)
   
libparquet.so.1500!parquet::SerializedFile::ParseUnencryptedFileMetadata(parquet::SerializedFile
 * const this, const std::shared_ptr<arrow::Buffer> & metadata_buffer, const 
uint32_t metadata_len) (\src\arrow\cpp\src\parquet\file_reader.cc:626)
   
libparquet.so.1500!parquet::SerializedFile::ParseMetaData(parquet::SerializedFile
 * const this) (\src\arrow\cpp\src\parquet\file_reader.cc:444)
   
libparquet.so.1500!parquet::ParquetFileReader::Contents::Open(std::shared_ptr<arrow::io::RandomAccessFile>
 source, const parquet::ReaderProperties & props, 
std::shared_ptr<parquet::FileMetaData> metadata) 
(\src\arrow\cpp\src\parquet\file_reader.cc:764)
   
libparquet.so.1500!parquet::ParquetFileReader::Open(std::shared_ptr<arrow::io::RandomAccessFile>
 source, const parquet::ReaderProperties & props, 
std::shared_ptr<parquet::FileMetaData> metadata) 
(\src\arrow\cpp\src\parquet\file_reader.cc:802)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to