Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
......................................................................


Patch Set 4:

(14 comments)

Still need to look at tests. Code is getting close.

http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 304:   // Time zone for adjusting timestamp values read from Parquet files 
written by
single line


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 246:           
parent_->state_->LogError(ErrorMsg(TErrorCode::PARQUET_MR_WRITE_ZONE_INVALID,
I don't think logging an error is the right thing to do here. We will scan all 
files and just keep appending to the error log, but no useful results can come 
of a query in this state.

This error is not recoverable, so I think we should terminate this scan and the 
query.

Consider moving this code into Reset() or better add a new Open() or Init() 
function that is called once per column reader from 
HdfsParquetScanner::CreateColumnReaders(). That way you can return a Status 
early and not scan any data.


Line 653:     } else {
As mentioned above, we should terminate the query early when in such a bad 
state. Imo, there should be a DCHECK here.


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 194:   for (const auto& tz_region: tz_region_list_) {
No need to use auto here. const string& is clearer


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

Line 40:   /// If 'tz' is not found in the database, null pointer is returned.
null pointer -> nullptr


Line 46:   /// If 'tz' is not found in the database, null pointer is returned.
null pointer -> nullptr


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 125:   return true;
> What do you mean? The documentation doesn't say anything about an exception
Agree, I could not find any indication in the code either.


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 199:   /// conversion was successfull and false otherwise.
Comment on what the state of *this is when false is returned.


Line 202:   /// Converts from UTC to local time in the given timezone. Returns 
true if conversion
Comment on what the state of *this is when false is returned.


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

Line 357:   const char *tz = env->GetStringUTFChars(timezone, NULL);
const char*


http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 122:     "the parquet.mr.int96.write.zone table property to UTC for new 
tables. This changes "
consistent placing of space (at line end or at line beginning)


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 67:   }
> I was copying the code that validates the skip.header.line.count table prop
My apologies, but I was wrong on this one. I think it's unfortunate, but better 
to check this for every query. The alternative is to catch the issue during 
table loading, but Impala has no way to fix such tables today, because you can 
only ALTER tables that are fully loaded. That's an independent issue, but for 
now, it does seem better to check every time like for skip.header.linecount


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 184:       String timezone = 
getTblProperties().get(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE);
> Correct. I need some help here. How can I check if the table that will be c
You can check getFileFormat() here. The only non-HDFS formats we have are HBase 
and Kudu.


http://gerrit.cloudera.org:8080/#/c/5939/4/fe/src/main/java/org/apache/impala/service/FeSupport.java
File fe/src/main/java/org/apache/impala/service/FeSupport.java:

Line 87:   // Returns true if timezone String is valid, false otherwise.
valid according to the BE timezone database


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to