[ 
https://issues.apache.org/jira/browse/IMPALA-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144096#comment-17144096
 ] 

ASF subversion and git services commented on IMPALA-9747:
---------------------------------------------------------

Commit e6c930a38f54899e66ad83ab88d886dcd4c869f9 in impala's branch 
refs/heads/master from Daniel Becker
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e6c930a ]

IMPALA-9747: More fine-grained codegen for text file scanners

Currently if the materialization of any column cannot be codegen'd
because its type is unsupported (e.g. CHAR(N)), the whole codegen is
cancelled for the text scanner.

This commit adds the function TextConverter::SupportsCodegenWriteSlot
that returns whether the given ColumnType is supported. If the type is
not supported, HdfsScanner codegens code that calls the interpreted
version instead of failing codegen. For other columns codegen is used as
usually.

Benchmarks:
  Copied and modified a TPCH table with scale factor 5 to add a CHAR
  column to it::

    USE tpch5;
    CREATE TABLE IF NOT EXISTS lineitem_char AS
    SELECT *, CAST(l_shipdate AS CHAR(10)) l_shipdate_char
    FROM lineitem;

  Run the following query 100 times after one warm-up run with and
  without this change:

    SELECT *
    FROM tpch5.lineitem_char
    WHERE
      l_partkey BETWEEN 500 AND 500000 AND
      l_linestatus = 'F' AND
      l_quantity < 35 AND
      l_extendedprice BETWEEN 2000 AND 8000 AND
      l_discount > 0 AND
      l_tax BETWEEN 0.04 AND 0.06 AND
      l_returnflag IN ('A', 'N') AND
      l_shipdate_char < '1996-06-20'
    ORDER BY l_shipdate_char
    LIMIT 10;

  Without this commit: mean: 2.92, standard deviation: 0.13.
  With this commit:    mean: 2.21, standard deviation: 0.072.

Testing:
  The interesting cases regarding char are covered in
  
https://github.com/apache/impala/blob/0167c5b4242fcebf6be19aba5ecfb440204278ad/testdata/workloads/functional-query/queries/QueryTest/chars.test

Change-Id: Id370193af578ecf23ed3c6bfcc65fec448156fa3
Reviewed-on: http://gerrit.cloudera.org:8080/16059
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> More fine-grained codegen for text file scanners
> ------------------------------------------------
>
>                 Key: IMPALA-9747
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9747
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Daniel Becker
>            Priority: Major
>
> Currently if  the materialization of any column cannot be codegend for some 
> reason (e.g. it is CHAR(N)), then the whole codegen is cancelled for the text 
> scanner, see:
> https://github.com/apache/impala/blob/b5805de3e65fd1c7154e4169b323bb38ddc54f4f/be/src/exec/text-converter.cc#L112
> https://github.com/apache/impala/blob/58273fff601dcc763ac43f7cc275a174a2e18b6b/be/src/exec/hdfs-scanner.cc#L342
> It would be much better to use the non-codegend path only for the problematic 
> columns and use the codegend materialization for the rest + always do 
> conjunct  evaluation with codegen.
> The codegend path orders slots based on the conjuncts that use them and 
> evaluates conjuncts when the slots it need becomes available, so if the row 
> is dropped then the rest of the slots do not need to be materialized. A 
> simple solution would be to always do non-codegend slot materialization first 
> so that they are ready if a conjunct needs them. Moving the columns that are 
> not used by conjuncts to the end could be a further optimization.
> This came up during the materialization of BINARY columns, which needs  
> base64 decoding during materialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to