Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
......................................................................


Patch Set 1:

> > > (1 comment)
 > > >
 > > > Thanks for taking a look @Csaba! I've only responded to your
 > > > comment on overflow handling so far, as I want to get that
 > > behavior
 > > > locked down first.
 > >
 > > Are you sure that the query will be aborted - isn't it only one
 > > warning / file? If there are "normal" files processed before the
 > > one with overflows, then some rows will be returned before
 > running
 > > to the error. In the current implementation it is also possible
 > > that some rows are returned from the file before reaching the
 > > overflowing value.
 > >
 > > I am also not completely sure about the best approach.
 > >
 > > I would prefer the error to be non-fatal, so to print one
 > > warning/file and continue the query.
 > >
 > > If no important use case is lost this way, then I would prefer to
 > > use the formula and skip files based on metadata instead of
 > > checking every row.
 > >
 > > This would make it possible to print a useful error message, so
 > > that the user will know the precision needed to process the file,
 > > and can call ALTER TABLE to increase precision / decrease scale
 > and
 > > make the file usable. If the error is printed based the first
 > row,
 > > then this can become a long iterative process.
 >
 > Thats a fair point. It is throwing an error. The last test in
 > parquet-decimal-precision-and-scale-widening.test has a `CATCH`
 > section, so if I understand the logic in ImpalaTestSuite#run_test_case
 > then that means the query threw an exception. I agree that its
 > possible for some rows to be returned before this error is throw,
 > but I guess thats an issue with many runtime errors in Impala.
 >
 > I'm good with printing a non-fatal error and setting the row to
 > NULL. This is what Hive does, so we will be consistent with Hive.
 >
 > Using the formula will prevent overflows from happening in the
 > first place, so it somewhat orthogonal to the discussion of error
 > handling. The only drawback I can see to using the formula you
 > suggested is that it is not exactly SQL Standard compliant. It's
 > entirely possible that the formula prevents a query from running,
 > even though none of the values in the table would cause an
 > overflow. The standard says that values should be rounded or
 > truncated to the match target precision / scale, but I guess it
 > doesn't explicitly mention overflows so its hard to say.
 >
 > I think that following Hive's behavior might be best here. I've
 > added @Lars to the review to see what he thinks.

I agree that following Hive's behavior would be best. For other errors we also 
usually issue a warning and return null, unless abort_on_error is set.

On a side note: If the parquet file has min/max statistics we should be able to 
tell whether an overflow would happen or not. However, if we want to have 
per-row behavior, that won't help us much.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Comment-Date: Wed, 09 Jan 2019 18:42:20 +0000
Gerrit-HasComments: No

Reply via email to