[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

Sahil Takiar (Code Review) Mon, 07 Jan 2019 13:32:30 -0800

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )


Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
......................................................................


Patch Set 1:

> > (1 comment)
 > >
 > > Thanks for taking a look @Csaba! I've only responded to your
 > > comment on overflow handling so far, as I want to get that
 > behavior
 > > locked down first.
 >
 > Are you sure that the query will be aborted - isn't it only one
 > warning / file? If there are "normal" files processed before the
 > one with overflows, then some rows will be returned before running
 > to the error. In the current implementation it is also possible
 > that some rows are returned from the file before reaching the
 > overflowing value.
 >
 > I am also not completely sure about the best approach.
 >
 > I would prefer the error to be non-fatal, so to print one
 > warning/file and continue the query.
 >
 > If no important use case is lost this way, then I would prefer to
 > use the formula and skip files based on metadata instead of
 > checking every row.
 >
 > This would make it possible to print a useful error message, so
 > that the user will know the precision needed to process the file,
 > and can call ALTER TABLE to increase precision / decrease scale and
 > make the file usable. If the error is printed based the first row,
 > then this can become a long iterative process.

Thats a fair point. It is throwing an error. The last test in 
parquet-decimal-precision-and-scale-widening.test has a `CATCH` section, so if 
I understand the logic in ImpalaTestSuite#run_test_case then that means the 
query threw an exception. I agree that its possible for some rows to be 
returned before this error is throw, but I guess thats an issue with many 
runtime errors in Impala.

I'm good with printing a non-fatal error and setting the row to NULL. This is 
what Hive does, so we will be consistent with Hive.

Using the formula will prevent overflows from happening in the first place, so 
it somewhat orthogonal to the discussion of error handling. The only drawback I 
can see to using the formula you suggested is that it is not exactly SQL 
Standard compliant. It's entirely possible that the formula prevents a query 
from running, even though none of the values in the table would cause an 
overflow. The standard says that values should be rounded or truncated to the 
match target precision / scale, but I guess it doesn't explicitly mention 
overflows so its hard to say.

I think that following Hive's behavior might be best here. I've added @Lars to 
the review to see what he thinks.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Comment-Date: Mon, 07 Jan 2019 21:32:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087: Read Parquet decimal columns with lower precision/scale

Reply via email to