Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 )
Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale ...................................................................... Patch Set 1: > > (1 comment) > > > > Thanks for taking a look @Csaba! I've only responded to your > > comment on overflow handling so far, as I want to get that > behavior > > locked down first. > > Are you sure that the query will be aborted - isn't it only one > warning / file? If there are "normal" files processed before the > one with overflows, then some rows will be returned before running > to the error. In the current implementation it is also possible > that some rows are returned from the file before reaching the > overflowing value. > > I am also not completely sure about the best approach. > > I would prefer the error to be non-fatal, so to print one > warning/file and continue the query. > > If no important use case is lost this way, then I would prefer to > use the formula and skip files based on metadata instead of > checking every row. > > This would make it possible to print a useful error message, so > that the user will know the precision needed to process the file, > and can call ALTER TABLE to increase precision / decrease scale and > make the file usable. If the error is printed based the first row, > then this can become a long iterative process. Thats a fair point. It is throwing an error. The last test in parquet-decimal-precision-and-scale-widening.test has a `CATCH` section, so if I understand the logic in ImpalaTestSuite#run_test_case then that means the query threw an exception. I agree that its possible for some rows to be returned before this error is throw, but I guess thats an issue with many runtime errors in Impala. I'm good with printing a non-fatal error and setting the row to NULL. This is what Hive does, so we will be consistent with Hive. Using the formula will prevent overflows from happening in the first place, so it somewhat orthogonal to the discussion of error handling. The only drawback I can see to using the formula you suggested is that it is not exactly SQL Standard compliant. It's entirely possible that the formula prevents a query from running, even though none of the values in the table would cause an overflow. The standard says that values should be rounded or truncated to the match target precision / scale, but I guess it doesn't explicitly mention overflows so its hard to say. I think that following Hive's behavior might be best here. I've added @Lars to the review to see what he thinks. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Mon, 07 Jan 2019 21:32:17 +0000 Gerrit-HasComments: No
