Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301 PS10, Line 301: std::vector<std::string> min_values_; > I'm still concerned about the amount of untracked memory from min_values_ a I had a discussion with the Parquet guys who implement the page indexes for parquet-mr and they plan to truncate the min/max values at 64 bytes. It seems reasonable because it is unlikely that users want to filter against very long string data. Maybe we could also use 64 bytes by default and let the users override it with an advanced query option. (specifying different sizes for different columns would make more sense, but I have no idea how to do that) If we go with this default, do you still think it will end up using too much untracked memory? As you wrote, if we start using StringValue/StringBuffer, we will need to convert min/max values to vectors of strings when we write out the column index, which is a lot of unnecessary copying and a short temporary peak of untracked memory. Or, to avoid the untracked memory peak, we can implement our own ColumnIndex class that uses StringBuffers and has a 'write(TProtocol *prot)' method. I think we can't avoid copying everything, because the TProtocol class only accepts std::strings for writing binary data. To summarize, I can only see 3 options, all with some drawbacks: * use std::strings pro: code is straight-forward pro: no copying con: lot of untracked memory * use StringBuffers pro: avoids long-living untracked memory con: lots of copying con: short-living untracked memory * implement our own ColumnIndex class pro: avoids untracked memory con: lots of copying con: need to maintain that class Am I missing something? Which option do you prefer? -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 16 Apr 2018 13:48:20 +0000 Gerrit-HasComments: Yes
