[jira] [Commented] (DRILL-4704) select statement behavior is inconsistent for decimal values in parquet

Dave Oshinsky (JIRA) Wed, 13 Jul 2016 07:39:35 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375108#comment-15375108
 ]


Dave Oshinsky commented on DRILL-4704:
--------------------------------------

One more thing to add... In the scenario addressed by this fix, there is a 
const integer that came from the SQL select where clause, and that is what is 
being casted into a decimal, for comparison with each a decimal value in each 
row.   Since the integer is constant, a thoroughly optimized piece of code 
would cast it just once to a decimal, and use the stored decimal to compare 
with the decimal value in each row.  It appears that that is not what this code 
is doing. It is casting repeatedly, for each row.  So, rather than be concerned 
with just a few added integer divisions by 10 (a small number around log10(the 
const integer) + 1 divides), this code should ideally be heavily modified to do 
the cast operation just once.  That's too big a change for me to make for this 
fix.  Besides, which is preferable?  Code that does not function properly 
(DRILL-4704), and is not well optimized?  Or code that behaves correctly, and 
is only slightly less well optimized? 

    On Tuesday, July 12, 2016 2:27 PM, ASF GitHub Bot (JIRA) <[email protected]> 
wrote:
 

 
    [ 
https://issues.apache.org/jira/browse/DRILL-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373401#comment-15373401
 ] 

ASF GitHub Bot commented on DRILL-4704:
---------------------------------------

Github user daveoshinsky commented on a diff in the pull request:

    https://github.com/apache/drill/pull/517#discussion_r70493981
  
    --- Diff: 
exec/java-exec/src/main/codegen/templates/Decimal/CastIntDecimal.java ---
    @@ -64,7 +64,15 @@ public void setup() {
    
        public void eval() {
            out.scale = (int) scale.value;
    -        out.precision = (int) precision.value;
    --- End diff --
    
    I tried doing it in the setup method, and it did not work. It seems the 
integer value to be casted was not available at that time.  Since the number of 
divisions by 10 (these are cheap integer divisions) is proportional to 
log10(the integer number), I don't see how this is expensive, especially 
compared to the expense of reading the rows from whatever device. 
    
        On Tuesday, July 12, 2016 1:56 PM, Aman Sinha 
<[email protected]> wrote:
    
    
    In exec/java-exec/src/main/codegen/templates/Decimal/CastIntDecimal.java:> 
@@ -64,7 +64,15 @@ public void setup() {
    >  
    >      public void eval() {
    >          out.scale = (int) scale.value;
    > -        out.precision = (int) precision.value;
    The eval() method is called for every row, so there will be a performance 
overhead. I agree about the storage cost if we assume worst case precision; 
however the per-row evaluation should be as lightweight as possible and likely 
outweighs the storage consideration. One option is to treat the IntDecimal and 
BigIntDecimal separately. For IntDecimal the max int value is 2^31 - 1 which 
has a max precision of 10. Similarly, there is an upper bound on the 
BigIntDecimal. You could initialize the precision in the setup() method which 
is called once per new schema (in the normal case the schema of that column 
does not change, this would avoid unnecessarily computing the precision). —
    You are receiving this because you authored the thread.
    Reply to this email directly, view it on GitHub, or mute the thread.  
    
      





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)



> select statement behavior is inconsistent for decimal values in parquet
> -----------------------------------------------------------------------
>
>                 Key: DRILL-4704
>                 URL: https://issues.apache.org/jira/browse/DRILL-4704
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.6.0
>         Environment: Windows 7 Pro, Java 1.8.0_91
>            Reporter: Dave Oshinsky
>             Fix For: Future
>
>
> A select statement that searches a parquet file for a decimal value matching 
> a specific value behaves inconsistently.  The query expressed most simply 
> finds nothing:
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where 
> employee_id = 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  | 
> HIRE_DATE |
> +--------------+-------------+------------+--------+---------------+-----------+
> +--------------+-------------+------------+--------+---------------+-----------+
> No rows selected (0.348 seconds)
> The query can be modified to find the matching row in a few ways, such as the 
> following (using between instead of '=', changing 100 to 100.0, or casting as 
> decimal:
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where 
> employee_id between 100 and 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       
> HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 
> 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.226 seconds)
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where 
> employee_id = 100.0;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       
> HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 
> 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.259 seconds)
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where 
> cast(employee_id AS DECIMAL) = 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       
> HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 
> 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.232 seconds)
> 0: jdbc:drill:zk=local>
> The schema of the parquet data that is being searched is as follows:
> $ java -jar parquet-tools*1.jar meta c:/archiveHR/HR.EMPLOYEES/1.parquet
> file:           file:/c:/archiveHR/HR.EMPLOYEES/1.parquet
> creator:        parquet-mr version 1.8.1 (build 
> 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
> .....
> file schema:    HR.EMPLOYEES
> --------------------------------------------------------------------------------
> EMPLOYEE_ID:    REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
> FIRST_NAME:     OPTIONAL BINARY O:UTF8 R:0 D:1
> LAST_NAME:      REQUIRED BINARY O:UTF8 R:0 D:0
> EMAIL:          REQUIRED BINARY O:UTF8 R:0 D:0
> PHONE_NUMBER:   OPTIONAL BINARY O:UTF8 R:0 D:1
> HIRE_DATE:      REQUIRED BINARY O:UTF8 R:0 D:0
> JOB_ID:         REQUIRED BINARY O:UTF8 R:0 D:0
> SALARY:         OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> COMMISSION_PCT: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> MANAGER_ID:     OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> DEPARTMENT_ID:  OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:    RC:107 TS:9943 OFFSET:4
> --------------------------------------------------------------------------------
> EMPLOYEE_ID:     FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:360/355/0.99 
> VC:107 ENC:PLAIN,BIT_PACKED
> FIRST_NAME:      BINARY SNAPPY DO:0 FPO:364 SZ:902/1058/1.17 VC:107 
> ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
> LAST_NAME:       BINARY SNAPPY DO:0 FPO:1266 SZ:913/1111/1.22 VC:107 
> ENC:PLAIN,BIT_PACKED
> EMAIL:           BINARY SNAPPY DO:0 FPO:2179 SZ:977/1184/1.21 VC:107 
> ENC:PLAIN,BIT_PACKED
> PHONE_NUMBER:    BINARY SNAPPY DO:0 FPO:3156 SZ:750/1987/2.65 VC:107 
> ENC:PLAIN,RLE,BIT_PACKED
> HIRE_DATE:       BINARY SNAPPY DO:0 FPO:3906 SZ:874/2636/3.02 VC:107 
> ENC:PLAIN_DICTIONARY,BIT_PACKED
> JOB_ID:          BINARY SNAPPY DO:0 FPO:4780 SZ:254/302/1.19 VC:107 
> ENC:PLAIN_DICTIONARY,BIT_PACKED
> SALARY:          FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5034 SZ:419/580/1.38 
> VC:107 ENC:PLAIN,RLE,BIT_PACKED
> COMMISSION_PCT:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5453 SZ:97/113/1.16 
> VC:107 ENC:PLAIN,RLE,BIT_PACKED
> MANAGER_ID:      FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5550 SZ:168/363/2.16 
> VC:107 ENC:PLAIN,RLE,BIT_PACKED
> DEPARTMENT_ID:   FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5718 SZ:94/254/2.70 
> VC:107 ENC:PLAIN,RLE,BIT_PACKED



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4704) select statement behavior is inconsistent for decimal values in parquet

Reply via email to