Hi,
I'm reading parquet file (generated by Java parquet library). Our schema
has 400 columns (including non-array elements, 1-dimensional array
elements).
I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,
I compile parquet-cpp with following cmake options,
```
cmake3 -DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF
-DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static"
-DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF .
```
One thing we noticed is that the cpp library conducts a lot of small
mallocs during the open file and the reading metadata phases... shown
below:
```
(gdb) where
#0 0x00007fdf40594801 in malloc () from /lib64/libc.so.6
#1 0x00007fdf40e52ecd in operator new(unsigned long) () from
/lib64/libstdc++.so.6
#2 0x0000000000ea16c0 in __gnu_cxx::new_allocator<std::string>::allocate
(this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104
#3 0x0000000000e9eabb in std::_Vector_base<std::string,
std::allocator<std::string> >::_M_allocate (this=0x33e6930, __n=3) at
/usr/include/c++/4.8.2/bits/stl_vector.h:168
#4 0x0000000000ecf512 in std::vector<std::string,
std::allocator<std::string> >::_M_default_append (this=0x33e6930, __n=3) at
/usr/include/c++/4.8.2/bits/vector.tcc:549
#5 0x0000000000eca887 in std::vector<std::string,
std::allocator<std::string> >::resize (this=0x33e6930, __new_size=3) at
/usr/include/c++/4.8.2/bits/stl_vector.h:667
#6 0x0000000000ebd589 in parquet::format::ColumnMetaData::read
(this=0x33e6908, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:3845
#7 0x0000000000ebf9ed in parquet::format::ColumnChunk::read
(this=0x33e68f0, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:4246
#8 0x0000000000ec0cd2 in parquet::format::RowGroup::read (this=0x33cf7c0,
iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451
#9 0x0000000000ec4e22 in parquet::format::FileMetaData::read
(this=0x3337270, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:5385
#10 0x0000000000e9364d in
parquet::DeserializeThriftMsg<parquet::format::FileMetaData>
(buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at
/opt/parquet-cpp/src/parquet/thrift.h:119
#11 0x0000000000e8fda5 in
parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0x3302fb0,
metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:303
#12 0x0000000000e8bf4f in parquet::FileMetaData::FileMetaData
(this=0x31a4ca0, metadata=0x7fdf2cace040
"\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:403
#13 0x0000000000e8bee3 in parquet::FileMetaData::Make
(metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:398
#14 0x0000000000e87572 in parquet::SerializedFile::ParseMetaData
(this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
#15 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open
(source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:247
#16 0x0000000000e85a6f in parquet::ParquetFileReader::Open
(source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:265
#17 0x0000000000e859ba in parquet::ParquetFileReader::Open
(source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=...,
metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:259
#18 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile
(path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:287
(gdb) info br
Num Type Disp Enb Address What
1 breakpoint keep y <MULTIPLE>
breakpoint already hit 2679 times
ignore next 2321 hits
```
I set the breakpoint to `malloc`, above ^
This seems to be the case regardless of mmap option.
Would really appreciate some pointer on how to avoid this.
Thanks,
Alex Wang,
--
Alex Wang,
Open vSwitch developer