Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301
PS10, Line 301:   std::vector<std::string> min_values_;
> I'm still concerned about the amount of untracked memory from min_values_ a
I had a discussion with the Parquet guys who implement the page indexes for 
parquet-mr and they plan to truncate the min/max values at 64 bytes. It seems 
reasonable because it is unlikely that users want to filter against very long 
string data.

Maybe we could also use 64 bytes by default and let the users override it with 
an advanced query option.
(specifying different sizes for different columns would make more sense, but I 
have no idea how to do that)

If we go with this default, do you still think it will end up using too much 
untracked memory?

As you wrote, if we start using StringValue/StringBuffer, we will need to 
convert min/max values to vectors of strings when we write out the column 
index, which is a lot of unnecessary copying and a short temporary peak of 
untracked memory.
Or, to avoid the untracked memory peak, we can implement our own ColumnIndex 
class that uses StringBuffers and has a 'write(TProtocol *prot)' method. I 
think we can't avoid copying everything, because the TProtocol class only 
accepts std::strings for writing binary data.

To summarize, I can only see 3 options, all with some drawbacks:

* use std::strings

    pro: code is straight-forward
    pro: no copying
    con: lot of untracked memory

* use StringBuffers

    pro: avoids long-living untracked memory
    con: lots of copying
    con: short-living untracked memory

* implement our own ColumnIndex class

    pro: avoids untracked memory
    con: lots of copying
    con: need to maintain that class

Am I missing something? Which option do you prefer?



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Mon, 16 Apr 2018 13:48:20 +0000
Gerrit-HasComments: Yes

Reply via email to