[
https://issues.apache.org/jira/browse/DRILL-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412329#comment-15412329
]
ASF GitHub Bot commented on DRILL-4704:
---------------------------------------
Github user daveoshinsky commented on the issue:
https://github.com/apache/drill/pull/517
The overflow problem will require separate changes to fix, from any changes
we make in the short-term to fix this issue (see DRILL-4834). However, if I
understand some of Paul's findings earlier, the two issues are related as
follows. If we choose a large precision based only on the kind of integer that
is input to the cast (either an int, or a long, but not minimized to represent
the integer value to be casted), then we are more likely to encounter the
overflow issue, because this precision is more likely to exceed the capacity of
the destination decimal of the cast. Is that right, Paul? Given the risk of
overflow, it is safer to choose the absolute minimum precision possible, which
is what the changes I made to CastIntDecimal.java are doing. We can't avoid
overflow in every situation, but we can try to minimize the chances of it. My
understanding is that the actual integer value to be casted isn't available in
ExpressionTreeMaterializer.java; at least, I did not see how to get it. And
that integer value is required to compute the minimum precision to represent
that value. I don't see how to fix this problem reliably in
ExpressionTreeMaterializer.java, as I said earlier. If you (Jinfeng) know how
to do that, why don't you just take the changes in PR-517, and then check
changes in on top of those to fix the problem your way? At least, you get my
unit test, which was for a condition (DRILL-4704) that was never tested by any
earlier unit test.
On Monday, August 8, 2016 3:25 PM, Jinfeng Ni
<[email protected]> wrote:
I'm not fully convinced that we should check the precision for each input
value for this castIntDecimal function. The argument of proposed patch is
parameter "precision=0" is not valid, and has to be calculated on-the-fly or
"dynamically" for each input value. To me, 1) if "precision=0" is a wrong input
to this function, then we should fix the place where precision=0 is passed in.
2) why would you treat "precision=0" only? what if I pass in "precision = 1",
and a integer value of 123456? in such case, do we think "precison=1" is valid
or not? Regarding the overflow problem, that seems to be a separate issue, and
probably is true for most of the existing decimal functions. —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
> select statement behavior is inconsistent for decimal values in parquet
> -----------------------------------------------------------------------
>
> Key: DRILL-4704
> URL: https://issues.apache.org/jira/browse/DRILL-4704
> Project: Apache Drill
> Issue Type: Bug
> Components: Functions - Drill
> Affects Versions: 1.6.0
> Environment: Windows 7 Pro, Java 1.8.0_91
> Reporter: Dave Oshinsky
> Fix For: Future
>
>
> A select statement that searches a parquet file for a decimal value matching
> a specific value behaves inconsistently. The query expressed most simply
> finds nothing:
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where
> employee_id = 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL | PHONE_NUMBER |
> HIRE_DATE |
> +--------------+-------------+------------+--------+---------------+-----------+
> +--------------+-------------+------------+--------+---------------+-----------+
> No rows selected (0.348 seconds)
> The query can be modified to find the matching row in a few ways, such as the
> following (using between instead of '=', changing 100 to 100.0, or casting as
> decimal:
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where
> employee_id between 100 and 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL | PHONE_NUMBER |
> HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100 | Steven | King | SKING | 515.123.4567 |
> 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.226 seconds)
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where
> employee_id = 100.0;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL | PHONE_NUMBER |
> HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100 | Steven | King | SKING | 515.123.4567 |
> 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.259 seconds)
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where
> cast(employee_id AS DECIMAL) = 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL | PHONE_NUMBER |
> HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100 | Steven | King | SKING | 515.123.4567 |
> 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.232 seconds)
> 0: jdbc:drill:zk=local>
> The schema of the parquet data that is being searched is as follows:
> $ java -jar parquet-tools*1.jar meta c:/archiveHR/HR.EMPLOYEES/1.parquet
> file: file:/c:/archiveHR/HR.EMPLOYEES/1.parquet
> creator: parquet-mr version 1.8.1 (build
> 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
> .....
> file schema: HR.EMPLOYEES
> --------------------------------------------------------------------------------
> EMPLOYEE_ID: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
> FIRST_NAME: OPTIONAL BINARY O:UTF8 R:0 D:1
> LAST_NAME: REQUIRED BINARY O:UTF8 R:0 D:0
> EMAIL: REQUIRED BINARY O:UTF8 R:0 D:0
> PHONE_NUMBER: OPTIONAL BINARY O:UTF8 R:0 D:1
> HIRE_DATE: REQUIRED BINARY O:UTF8 R:0 D:0
> JOB_ID: REQUIRED BINARY O:UTF8 R:0 D:0
> SALARY: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> COMMISSION_PCT: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> MANAGER_ID: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> DEPARTMENT_ID: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:107 TS:9943 OFFSET:4
> --------------------------------------------------------------------------------
> EMPLOYEE_ID: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:360/355/0.99
> VC:107 ENC:PLAIN,BIT_PACKED
> FIRST_NAME: BINARY SNAPPY DO:0 FPO:364 SZ:902/1058/1.17 VC:107
> ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
> LAST_NAME: BINARY SNAPPY DO:0 FPO:1266 SZ:913/1111/1.22 VC:107
> ENC:PLAIN,BIT_PACKED
> EMAIL: BINARY SNAPPY DO:0 FPO:2179 SZ:977/1184/1.21 VC:107
> ENC:PLAIN,BIT_PACKED
> PHONE_NUMBER: BINARY SNAPPY DO:0 FPO:3156 SZ:750/1987/2.65 VC:107
> ENC:PLAIN,RLE,BIT_PACKED
> HIRE_DATE: BINARY SNAPPY DO:0 FPO:3906 SZ:874/2636/3.02 VC:107
> ENC:PLAIN_DICTIONARY,BIT_PACKED
> JOB_ID: BINARY SNAPPY DO:0 FPO:4780 SZ:254/302/1.19 VC:107
> ENC:PLAIN_DICTIONARY,BIT_PACKED
> SALARY: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5034 SZ:419/580/1.38
> VC:107 ENC:PLAIN,RLE,BIT_PACKED
> COMMISSION_PCT: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5453 SZ:97/113/1.16
> VC:107 ENC:PLAIN,RLE,BIT_PACKED
> MANAGER_ID: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5550 SZ:168/363/2.16
> VC:107 ENC:PLAIN,RLE,BIT_PACKED
> DEPARTMENT_ID: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5718 SZ:94/254/2.70
> VC:107 ENC:PLAIN,RLE,BIT_PACKED
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)