Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 15:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@108
PS15, Line 108: PAGE_INDEX_TRUNCATE_LENGTH
> PAGE_INDEX_MAX_STRING_LENGTH?
Changed the name


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@130
PS15, Line 130: Write
> nit: "Writes"
Done, also fixed the previous and the next comment as well.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@48
PS15, Line 48: using namespace parquet;
> Can you remove this? I find omitting the parquet:: prefix confusing, and we
I agree, done.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@109
PS15, Line 109: HdfsSinkTable
> HdfsTableSink?
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@118
PS15, Line 118:     
table_sink_mem_tracker_->Release(page_index_memory_consumption_);
> Can we do this in Close() instead? We have generally been trying to do reso
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@312
PS15, Line 312:   OffsetIndex offset_index_;
> I find it useful to prefix Parquet types with parquet:: and we do that in p
I'm also in favor of prefixing. Done.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@318
PS15, Line 318:   MemTracker *table_sink_mem_tracker_;
> nit: MemTracker*
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@651
PS15, Line 651:       location.compressed_page_size = -1;
> Probably doesn't have much of an effect, but should we set the size to 0 fo
Yeah, it makes more sense.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@668
PS15, Line 668:     location.compressed_page_size = 
page.header.compressed_page_size + len;
> Should we add a comment for this line to explain that one includes the leng
Added a comment.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@752
PS15, Line 752:   string min_val, max_val;
> I think we usually do one declaration per line, but I don't feel strongly a
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@754
PS15, Line 754:     Status s_min = TruncateMinValue(page_stats.min_value, 
PAGE_INDEX_TRUNCATE_LENGTH,
> Could this truncate non-string values if PAGE_INDEX_TRUNCATE_LENGTH was sma
Added a comment.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@761
PS15, Line 761:     column_index_.null_pages.push_back(true);
> Maybe DCHECK that both are false, to catch errors that result in only one b
Added DCHECK.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@766
PS15, Line 766: Consume
> We prefer TryConsume() and returning a MEM_LIMIT_EXCEEDED error synchronous
Switched to TryConsume() and check its result.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1220
PS15, Line 1220:   // Currently we only support Parquet files with one row 
group.
> Can you add a sentence why this particular code relies on this limited supp
Updated the comment.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225
PS15, Line 1225: ++)
> nit: ++i
Switched to ++i.
I need 'i' to index 'columns_' and 'row_group->columns'.
The same stands for the next for-loop.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@88
PS15, Line 88:       if (a != a && b != b) return 0;
> Can we use std::isnan()?
Yes, done.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@113
PS15, Line 113: Used for merging page statistics into column chunk statistics.
> I think it's better to omit usage from the comment here. Instead say that i
Changed the added comment to your sentence.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@164
PS15, Line 164: Let's a
> nit: "We assume..." so it reads less like a suggestions.
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@165
PS15, Line 165: If not all values are equal
> What happens for NaN, i.e. when all values are NaN?
In ColumnStats<T>::Merge(), MinMaxTrait<T>::Compare() will compare the NaNs, 
and the current implementation will say that they are equal.
So, in the end the ascending and descending boundary order flags will be true. 
Just like for the case when all the values are equal. And when both flags are 
true, GetBoundaryOrder() will return ASCENDING. Maybe it should return 
UNORDERED, if you think it'd be more reasonable.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@54
PS15, Line 54:         if (MinMaxTrait<T>::Compare(prev_page_max_value_, 
cs->max_value_) > 0 ||
> I cannot convince myself that prev_page_max_value_ has to be initialized he
It is. If has_min_max_values_ is true, it means that we already invoked 
Update().
And after the Update() call we immediately have set the prev values.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@66
PS15, Line 66:     Update(cs->min_value_, cs->max_value_);
> If you do the order checks after this line, you can assume that has_min_max
Unfortunately not. 'has_min_max_values_ == true' implicitly also means that the 
prev page values are set.
If I move the Update up there it will also set has_min_max_values_ to true even 
though we don't have prev page values yet.

Maybe I could introduce another flag called 'has_prev_page_values_' to make the 
code more clear.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@224
PS15, Line 224:     Update(cs->min_value_, cs->max_value_);
> same here
Same problem as above


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@33
PS15, Line 33: void EvalTruncation(const string& original, const string& 
expected_result,
> Why did you fold both test methods into one and use an enum for the test, b
In the test it was easier to organize the code like this.

In non-test code I prefer to have different functions with fewer parameters. I 
don't feel very strong about it, so if you want, I can merge TruncateUp() and 
TruncateDown() to a single Truncate() function.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@41
PS15, Line 41:   EXPECT_EQ(expected_result, result);
> This checks result without checking the return status of TruncateMin/Max
Added ASSERT_OKs


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@45
PS15, Line 45:   EvalTruncation("0123456789", "0123456789", 100, MIN);
> Can you add tests with empty string, string that has \0 in it, and char min
Added new tests


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68
PS15, Line 68:
> Can you add a test for [0, 0x7F, 0xFF] to check that we don't rely on signe
Added new tess. Already had tests with 0xFF at L57-L66


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h@29
PS15, Line 29: Status TruncateMinValue(const std::string& str, int32_t 
max_length, std::string* result);
> These seem similar to numeric rounding. Maybe name them TruncateUp and Trun
Done, I like the new names.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h@36
PS15, Line 36: Status TruncateMaxValue(const std::string& str, int32_t 
max_length, std::string* result);
> This one should probably have a WARN_IF_UNUSED annotation since it modifies
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.cc@46
PS15, Line 46:   (*result)[i] += 1;
> (*result)[i] could be 0x7F here and this addition could overflow, resulting
Using an unsigned char instead.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 09 May 2018 15:10:13 +0000
Gerrit-HasComments: Yes

Reply via email to