Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15818 )
Change subject: IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list ...................................................................... Patch Set 5: Code-Review+1 (3 comments) I only have some nice to have comments. I can give +2 if no one else plans to look at the patch. http://gerrit.cloudera.org:8080/#/c/15818/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15818/5//COMMIT_MSG@35 PS5, Line 35: but since Hive Streaming doesn't write : statistics we validate the write id per ORC row batch. I think there is still a way to optimize this without statistics: we could read the stripe in two passes: 1. only reading a single row from the writeIds (which should be the same as all the others) 2. read the rest of the columns if the writeId was valid Even if the stripe is needed this would still be faster, as we wouldn't need to unpack the writeId (stored in RLE I guess) and fill batches with its value. I am totally ok with leaving it as it is, simplicity seems more important than optimization at this point, but it could be mentioned to make this design decision clear. http://gerrit.cloudera.org:8080/#/c/15818/5/be/src/exec/acid-metadata-utils.h File be/src/exec/acid-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/15818/5/be/src/exec/acid-metadata-utils.h@26 PS5, Line 26: ValidWriteIdList optional: Some unit tests would be nice for this class as it hides some real complexity behind a simple interface. Another "would be nice thing" to consider is move most of the complexity to the frontend, and pass the "decoded" validWriteIdList to the backend in a Thrift struct. Keeping it all in the frontend would make it easier to change the logic, e.g. if HMS changes something in the format. http://gerrit.cloudera.org:8080/#/c/15818/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15818/3/be/src/exec/hdfs-orc-scanner.cc@517 PS3, Line 517: tch* row_batch) { > Done The handling of this case is still not clear to me. I understand that we can't answer count(*) based on stripe metadata, as we have to check the actual writeId. My question is: what columns are we reading in this case? Only the one with writeId? -- To view, visit http://gerrit.cloudera.org:8080/15818 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d Gerrit-Change-Number: 15818 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 06 May 2020 12:06:49 +0000 Gerrit-HasComments: Yes
