Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12163 )

Change subject: IMPALA-7087: Read Parquet decimal columns with lower 
precision/scale
......................................................................


Patch Set 1:

> > Starting a new thread, since the last one was getting long.
 > >
 > > I started working on getting this patch in line with Hive's
 > > behavior - e.g. if an overflow occurs, set the row to NULL and
 > emit
 > > a warning. I hit an issue in the design of ScalarColumnReader::ConvertSlot.
 > > The ConvertSlot method returns a bool, if true the execution
 > > continues, if false the query fails. There is no clean way for
 > > ConvertSlot to set the tuple to NULL and continue processing.
 > > ScalarColumnReader::ValidateValue on the other had does allow for
 > > this. If ValidateValue returns false, and the parse_status is
 > ok()
 > > then the tuple is set to NULL and execution continues. If it
 > > returns false and the parse_status() is an error, it aborts the
 > > query. If it returns true, execution continues.
 > >
 > > I checked and all current implementations of ConvertSlot always
 > > return true, no matter what. So I propose changing ConvertSlot's
 > > return semantics to match those of ValidateValue.
 > >
 > > Any objections?
 >
 > I would still prefer the "exclude the whole file with warning if
 > overflow is possible" approach, as it would:
 > - give a useful error message
 > - probably make conversion faster, as a simple multiplication would
 > be enough
 > - would be simpler and need less testing
 >
 > If this is not an option (because enabling the potentially
 > overflowing conversion is really useful in some use cases), then I
 > agree with changing ConvertSlot() as you described.

A few points to consider on the "exclude the whole file with warning if 
overflow is possible" approach

1: I don't think we should exclude entire files with just a warning only 
because an overflow might be possible. That could significantly throw off query 
results. I think a more fine grained approach where we essentially exclude 
individual rows by setting them to NULL would be better.

2: Allowing for overflows only gives the user more flexibility; if they don't 
want overflows, they can set abort_on_error on the query will fail instead of 
setting rows to NULL.

3: Setting overflowed slots to NULL is consistent with how Impala handles 
similar issues elsewhere; most of the ValidateValue methods follow this 
approach.

4: Yes, there is a performance overhead to checking overflows. I haven't 
checked how much it is, but consider that many decimal operations done by 
Impala expressions invokes the overflow check as well (e.g. decimal 
multiplication, addition, etc).

5: Currently, if an Impala query multiplies two slots and the result overflows, 
the entire query fails. So I'm also ok with keeping the current behavior of 
this patch since it matches how Impala currently deals with overflows. However, 
I would still prefer to follow Hive's behavior because (1) consistency with 
Hive is useful, and (2) users can still set abort_on_error=true if they want.

If discussing this over Hangout is easier, happy to setup a call.


--
To view, visit http://gerrit.cloudera.org:8080/12163
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61
Gerrit-Change-Number: 12163
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Comment-Date: Thu, 24 Jan 2019 15:47:58 +0000
Gerrit-HasComments: No

Reply via email to