Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 )
Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale ...................................................................... Patch Set 1: > > > (1 comment) > > > > > > Thanks for taking a look @Csaba! I've only responded to your > > > comment on overflow handling so far, as I want to get that > > behavior > > > locked down first. > > > > Are you sure that the query will be aborted - isn't it only one > > warning / file? If there are "normal" files processed before the > > one with overflows, then some rows will be returned before > running > > to the error. In the current implementation it is also possible > > that some rows are returned from the file before reaching the > > overflowing value. > > > > I am also not completely sure about the best approach. > > > > I would prefer the error to be non-fatal, so to print one > > warning/file and continue the query. > > > > If no important use case is lost this way, then I would prefer to > > use the formula and skip files based on metadata instead of > > checking every row. > > > > This would make it possible to print a useful error message, so > > that the user will know the precision needed to process the file, > > and can call ALTER TABLE to increase precision / decrease scale > and > > make the file usable. If the error is printed based the first > row, > > then this can become a long iterative process. > > Thats a fair point. It is throwing an error. The last test in > parquet-decimal-precision-and-scale-widening.test has a `CATCH` > section, so if I understand the logic in ImpalaTestSuite#run_test_case > then that means the query threw an exception. I agree that its > possible for some rows to be returned before this error is throw, > but I guess thats an issue with many runtime errors in Impala. > > I'm good with printing a non-fatal error and setting the row to > NULL. This is what Hive does, so we will be consistent with Hive. > > Using the formula will prevent overflows from happening in the > first place, so it somewhat orthogonal to the discussion of error > handling. The only drawback I can see to using the formula you > suggested is that it is not exactly SQL Standard compliant. It's > entirely possible that the formula prevents a query from running, > even though none of the values in the table would cause an > overflow. The standard says that values should be rounded or > truncated to the match target precision / scale, but I guess it > doesn't explicitly mention overflows so its hard to say. > > I think that following Hive's behavior might be best here. I've > added @Lars to the review to see what he thinks. I agree that following Hive's behavior would be best. For other errors we also usually issue a warning and return null, unless abort_on_error is set. On a side note: If the parquet file has min/max statistics we should be able to tell whether an overflow would happen or not. However, if we want to have per-row behavior, that won't help us much. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Comment-Date: Wed, 09 Jan 2019 18:42:20 +0000 Gerrit-HasComments: No