Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15818 )

Change subject: IMPALA-9512: Full ACID Milestone 2: Validate rows against the 
valid write id list
......................................................................


Patch Set 5: Code-Review+1

(3 comments)

I only have some nice to have comments.
I can give +2 if no one else plans to look at the patch.

http://gerrit.cloudera.org:8080/#/c/15818/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15818/5//COMMIT_MSG@35
PS5, Line 35: but since Hive Streaming doesn't write
            : statistics we validate the write id per ORC row batch.
I think there is still a way to optimize this without statistics:
we could read the stripe in two passes:
1. only reading a single row from the writeIds (which should be the same as all 
the others)
2. read the rest of the columns if the writeId was valid

Even if the stripe is needed this would still be faster, as we wouldn't need to 
unpack the writeId (stored in RLE I guess) and fill batches with its value.

I am totally ok with leaving it as it is, simplicity seems more important than 
optimization at this point, but it could be mentioned to make this design 
decision clear.


http://gerrit.cloudera.org:8080/#/c/15818/5/be/src/exec/acid-metadata-utils.h
File be/src/exec/acid-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/15818/5/be/src/exec/acid-metadata-utils.h@26
PS5, Line 26: ValidWriteIdList
optional: Some unit tests would be nice for this class as it hides some real 
complexity behind a simple interface.

Another "would be nice thing" to consider is move most of the complexity to the 
frontend, and pass the "decoded" validWriteIdList to the backend in a Thrift 
struct. Keeping it all in the frontend would make it easier to change the 
logic, e.g. if HMS changes something in the format.


http://gerrit.cloudera.org:8080/#/c/15818/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15818/3/be/src/exec/hdfs-orc-scanner.cc@517
PS3, Line 517: tch* row_batch) {
> Done
The handling of this case is still not clear to me.
I understand that we can't answer count(*) based on stripe metadata, as we have 
to check the actual writeId. My question is: what columns are we reading in 
this case? Only the one with writeId?



--
To view, visit http://gerrit.cloudera.org:8080/15818
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d
Gerrit-Change-Number: 15818
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 06 May 2020 12:06:49 +0000
Gerrit-HasComments: Yes

Reply via email to