Re: 回复：回复：Re: Re: parquet performance

Wes McKinney Wed, 14 Mar 2018 17:04:32 -0700

Adding the mailing list back and adding the benchmark script

I notice one likely-serious problem: you are spawning num_columns *
num_row_groups threads all at once. Based on what you've described
about your data, that's ~300 threads simultaneously. I would recommend
setting the number of threads equal to the number of CPU cores and
reading the row groups serially to obtain best performance.


On Wed, Mar 14, 2018 at 4:22 PM, Wes McKinney <[email protected]> wrote:
> hi Jin Hai,
>
> The test data file is not provided -- do you have a benchmark case
> that also generates a file on disk?  Otherwise, I can generate a
> synthetic test dataset.
>
> By the way, I'm just planning to generate a FlameGraph from recorded
> perf data to see where time is being spent. You can do this too:
>
> https://github.com/brendangregg/FlameGraph
>
> - Wes
>
> On Wed, Mar 14, 2018 at 2:47 AM,  <[email protected]> wrote:
>> Hi Wes,
>>
>> Attached program is the benchmark program, would you please help check why
>> the performance is not good.
>>
>> Ubuntu-16.04.2
>> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
>> processor number: 8
>> mem: 16G
>> disk: nvme (about 2GB/s)
>>
>> Performance(release build):
>> Multi thread:  about 10s, I/O=250M/s
>>
>> Best Regards
>> Jin Hai
>> ----- 原始邮件 -----
>> 发件人：<[email protected]>
>> 收件人："mildwolf_jh" <[email protected]>, "Wes McKinney"
>> <[email protected]>
>> 主题：回复：回复：Re: Re: parquet performance
>> 日期：2018年03月11日 16点36分
>>
>> We rewrite the case, now the performance is better, but it's still can't
>> reach the I/O limitation. Here is the information and benchmark programming:
>>
>> File information:
>> file size: 2.5G
>> group count = 6; (each group has 10,000,000 rows)
>> row count = 50,000,999;
>> column count = 51;  (types: int16, int32, int64, float, double, byte_array,
>> fixed_len_byte_array)
>>
>> System information:
>> Ubuntu-16.04.2
>> processor: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
>> processor number: 8
>> mem: 32G
>> disk: SSD
>>
>> Performance(release build):
>> Single thread:  about 17s, I/O = 147M/s
>> Multi thread:  about 4.5s, I/O=555M/s
>> =========================
>> Another hardward platform:
>> Ubuntu-16.04.2
>> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
>> processor number: 8
>> mem: 16G
>> disk: nvme (about 2GB/s)
>>
>> Performance(release build):
>> Multi thread:  about 10s, I/O=250M/s
>>
>> benchmark program is attached.
>>
>> ----- 原始邮件 -----
>> 发件人：<[email protected]>
>> 收件人："Wes McKinney" <[email protected]>
>> 主题：回复：Re: Re: parquet performance
>> 日期：2018年03月10日 00点31分
>>
>> OK, I will try to get the benchmarking program and send to you later.
>>
>> ----- 原始邮件 -----
>> 发件人：Wes McKinney <[email protected]>
>> 收件人：[email protected]
>> 主题：Re: Re: parquet performance
>> 日期：2018年03月10日 00点12分
>>
>>
>>> It's about 50MB/s reading with API readRowGroup, with 8 cores 3.3GHz CPU.
>> This seems slow. I'd like to help diagnose what is going wrong, is
>> there any chance you could produce a benchmarking program? We can help
>> investigate. Are you building with -DCMAKE_BUILD_TYPE=release ?
>> On Fri, Mar 9, 2018 at 11:09 AM, <[email protected]> wrote:
>>> In fact, I already tried several numeric type, int16/int32/int64,
>>> float32/float64. It's about 50MB/s reading with API readRowGroup, with 8
>>> cores 3.3GHz CPU. And yes, I just use one thread to do it. It seems CPU
>>> use
>>> to much time to check something synchronized and it costs too much time.
>>> You are right, I check the code it seems TransferFunctor template to read
>>> the data. Am I right?
>>> Back to my issue, is there best practice to use parquet->arrow table, so I
>>> can let the I/O speed close to it's limitation, when we read one column?
>>> Or
>>> it's some kind of parquet limitation?
>>> Thank you for your prompt response.
>>> ----- 原始邮件 -----
>>> 发件人：Wes McKinney <[email protected]>
>>> 收件人：Parquet Dev <[email protected]>, [email protected]
>>> 主题：Re: parquet performance
>>> 日期：2018年03月09日 23点14分
>>>
>>>
>>> hello,
>>> It sounds like you are talking about the C++ implementation in
>>>
>>> https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc,
>>> is that right?
>>> Which data types are you benchmarking? My understanding is that we are
>>> not appending 1 cell at a time. Let us know.
>>> Thanks
>>> Wes
>>> On Fri, Mar 9, 2018 at 9:55 AM, <[email protected]> wrote:
>>>> Hi, I am testing parquet->arrow performance and find it's really slow to
>>>> read parquet file into arrow table. When I check the parquet source code,
>>>> it
>>>> seems parquet need to check the null value and use arrow Append method to
>>>> insert the cell one by one. Although we can use multithread to speed up
>>>> the
>>>> reading when we have several column in a fragment. But the I/O
>>>> performance
>>>> is still far from it's limitation. I want to know is there any reason,
>>>> parquet can reach better reading performance?

#include <fstream>
#include <iostream>
#include <vector>
#include <memory>
#include <chrono>
#include <thread>

#include <arrow/io/file.h>
#include <arrow/util/logging.h>
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>

using stdclock = std::chrono::high_resolution_clock;

int main(int argc, char** argv) {
  stdclock::time_point start = stdclock::now();

  parquet::ReaderProperties props;
  props.set_buffer_size(1024*1024);
  props.enable_buffered_stream();

  std::unique_ptr<parquet::ParquetFileReader> parquet_reader =
      parquet::ParquetFileReader::OpenFile("1.parquet", true, props);

  std::shared_ptr<parquet::FileMetaData> file_metadata = parquet_reader->metadata();

  int num_row_groups = file_metadata->num_row_groups();
  std::cout << "Groups = " << num_row_groups << std::endl;

  int num_columns = file_metadata->num_columns();
  std::cout << "Columns = " << num_columns << std::endl;

  parquet::arrow::FileReader arrow_reader(arrow::default_memory_pool(), std::move(parquet_reader));

  arrow_reader.set_num_threads(num_columns);
  std::function<void(int, const std::vector<int >&, parquet::arrow::FileReader*)> workFunction = [](
      int fragId,
      const std::vector<int >& colIds,
      parquet::arrow::FileReader* reader) {
    std::shared_ptr<arrow::Table> table;
    reader->ReadRowGroup(fragId, colIds, &table);
  };

  std::vector<int> indices;
  for(int i = 0; i < num_columns; i++)
    indices.push_back(i);
  std::vector<std::thread> work_threads;
  for(int i = 0; i < num_row_groups; i++){
    work_threads.push_back(std::thread(workFunction, i, indices, &arrow_reader));
  }

  for (auto &thread : work_threads) {
    thread.join();
  }

  stdclock::time_point end = stdclock::now();
  double span = (std::chrono::duration<double, std::milli>(end - start)).count();
  std::cout << "Totally cost: " << span << " ms" << std::endl;

  return 0;
}

Re: 回复：回复：Re: Re: parquet performance

Reply via email to