hi Alex, It looks like the mallocs are coming from Thrift (parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we can do much about this. I'm curious if it's possible to pass a custom STL allocator to Thrift so we could use a different allocation strategy than the default STL allocator
- Wes On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang <[email protected]> wrote: > Hi, > > I'm reading parquet file (generated by Java parquet library). Our schema > has 400 columns (including non-array elements, 1-dimensional array > elements). > > I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53, > > I compile parquet-cpp with following cmake options, > ``` > cmake3 -DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF > -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static" > -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF . > ``` > > One thing we noticed is that the cpp library conducts a lot of small > mallocs during the open file and the reading metadata phases... shown > below: > > ``` > (gdb) where > #0 0x00007fdf40594801 in malloc () from /lib64/libc.so.6 > #1 0x00007fdf40e52ecd in operator new(unsigned long) () from > /lib64/libstdc++.so.6 > #2 0x0000000000ea16c0 in __gnu_cxx::new_allocator<std::string>::allocate > (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104 > #3 0x0000000000e9eabb in std::_Vector_base<std::string, > std::allocator<std::string> >::_M_allocate (this=0x33e6930, __n=3) at > /usr/include/c++/4.8.2/bits/stl_vector.h:168 > #4 0x0000000000ecf512 in std::vector<std::string, > std::allocator<std::string> >::_M_default_append (this=0x33e6930, __n=3) at > /usr/include/c++/4.8.2/bits/vector.tcc:549 > #5 0x0000000000eca887 in std::vector<std::string, > std::allocator<std::string> >::resize (this=0x33e6930, __new_size=3) at > /usr/include/c++/4.8.2/bits/stl_vector.h:667 > #6 0x0000000000ebd589 in parquet::format::ColumnMetaData::read > (this=0x33e6908, iprot=0x3337300) at > /opt/parquet-cpp/src/parquet/parquet_types.cpp:3845 > #7 0x0000000000ebf9ed in parquet::format::ColumnChunk::read > (this=0x33e68f0, iprot=0x3337300) at > /opt/parquet-cpp/src/parquet/parquet_types.cpp:4246 > #8 0x0000000000ec0cd2 in parquet::format::RowGroup::read (this=0x33cf7c0, > iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451 > #9 0x0000000000ec4e22 in parquet::format::FileMetaData::read > (this=0x3337270, iprot=0x3337300) at > /opt/parquet-cpp/src/parquet/parquet_types.cpp:5385 > #10 0x0000000000e9364d in > parquet::DeserializeThriftMsg<parquet::format::FileMetaData> > (buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", > len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at > /opt/parquet-cpp/src/parquet/thrift.h:119 > #11 0x0000000000e8fda5 in > parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0x3302fb0, > metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", > metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:303 > #12 0x0000000000e8bf4f in parquet::FileMetaData::FileMetaData > (this=0x31a4ca0, metadata=0x7fdf2cace040 > "\025\002\031\374\313\004H\bsessions\025\374\005", > metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:403 > #13 0x0000000000e8bee3 in parquet::FileMetaData::Make > (metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", > metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:398 > #14 0x0000000000e87572 in parquet::SerializedFile::ParseMetaData > (this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213 > #15 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open > (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0, > props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:247 > #16 0x0000000000e85a6f in parquet::ParquetFileReader::Open > (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0, > props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:265 > #17 0x0000000000e859ba in parquet::ParquetFileReader::Open > (source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=..., > metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:259 > #18 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile > (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030", > memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:287 > > (gdb) info br > Num Type Disp Enb Address What > 1 breakpoint keep y <MULTIPLE> > breakpoint already hit 2679 times > ignore next 2321 hits > ``` > > I set the breakpoint to `malloc`, above ^ > > This seems to be the case regardless of mmap option. > > Would really appreciate some pointer on how to avoid this. > > Thanks, > Alex Wang, > > -- > Alex Wang, > Open vSwitch developer
