[ 
https://issues.apache.org/jira/browse/IMPALA-12796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817568#comment-17817568
 ] 

ASF subversion and git services commented on IMPALA-12796:
----------------------------------------------------------

Commit 13030f840a23e67c5e9923e8b1abab3b717c106a in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=13030f840 ]

IMPALA-12796: Add is_footer_only in TFileSplitGeneratorSpec

Several tests in test_scanners.py failed by wrong row counts with S3
target filesystem after IMPALA-12631. S3 filesystem does not have block.
Planner will produce TFileSplitGeneratorSpec instead of
TScanRangeLocationList, and IMPALA-12631 miss to address necessary
changes in TFileSplitGeneratorSpec. Meanwhile, it already changed the
behavior of hdfs-parquet-scanner.cc. For each scan range, the new code
will loop file_metadata_.row_groups, while the old code just take one
entry of file_metadata_.row_groups after calling NextRowGroup().

This patch address the issue by adding is_footer_only field in
TFileSplitGeneratorSpec schedule accordingly in schedule.cc. This also
add field 'is_footer_scanner_' in hdfs-columnar-scanner.h to check that
optimized count star only applied with footer range.

Testing:
- Pass core tests with S3 target filesystem.

Change-Id: Iaa6e3c14debe68cf601131c6594774c8c695923e
Reviewed-on: http://gerrit.cloudera.org:8080/21021
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Several tests in test_scanners.py failed by wrong row counts
> ------------------------------------------------------------
>
>                 Key: IMPALA-12796
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12796
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Quanlong Huang
>            Assignee: Riza Suminto
>            Priority: Critical
>             Fix For: Impala 4.4.0
>
>
> Several tests in test_scanners.py failed by wrong row counts. The actual row 
> count is extremely large, e.g
> {noformat}
> query_test/test_scanners.py:97: in test_scanners
>     self.run_test_case('QueryTest/scanners', new_vector)
> common/impala_test_suite.py:756: in run_test_case
>     self.__verify_results_and_errors(vector, test_section, result, use_db)
> common/impala_test_suite.py:589: in __verify_results_and_errors
>     replace_filenames_with_placeholder)
> common/test_result_verifier.py:487: in verify_raw_results
>     VERIFIER_MAP[verifier](expected, actual)
> common/test_result_verifier.py:296: in verify_query_result_is_equal
>     assert expected_results == actual_results
> E   assert Comparing QueryTestResults (expected vs actual):
> E     100 != 378250{noformat}
> The query is
> {code:sql}
> select count(*) from alltypessmall;{code}
> Console output:
> {noformat}
> SET 
> client_identifier=query_test/test_scanners.py::TestScannersAllTableFormats::()::test_scanners[protocol:beeswax|table_format:parquet/none|exec_option:{'test_replan':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'disable_codegen':False;'abort_on_error':1;;
> -- connecting to: localhost:21000
> -- 2024-02-07 11:06:41,035 INFO     MainThread: Could not connect to ('::1', 
> 21000, 0, 0)
> Traceback (most recent call last):
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
>  line 137, in open
>     handle.connect(sockaddr)
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
>  line 228, in meth
>     return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> -- connecting to localhost:21050 with impyla
> -- 2024-02-07 11:06:41,035 INFO     MainThread: Could not connect to ('::1', 
> 21050, 0, 0)
> Traceback (most recent call last):
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py",
>  line 137, in open
>     handle.connect(sockaddr)
>   File 
> "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
>  line 228, in meth
>     return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> -- 2024-02-07 11:06:41,049 INFO     MainThread: Closing active operation
> -- connecting to localhost:28000 with impyla
> -- 2024-02-07 11:06:41,073 INFO     MainThread: Closing active operation
> SET 
> client_identifier=query_test/test_scanners.py::TestScannersAllTableFormats::()::test_scanners[protocol:beeswax|table_format:parquet/none|exec_option:{'test_replan':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'disable_codegen':False;'abort_on_error':1;;
> -- executing against localhost:21000
> use functional_parquet;
> -- 2024-02-07 11:06:41,085 INFO     MainThread: Started query 
> 8c4dd086f2a4904d:45aea02c00000000
> SET 
> client_identifier=query_test/test_scanners.py::TestScannersAllTableFormats::()::test_scanners[protocol:beeswax|table_format:parquet/none|exec_option:{'test_replan':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'disable_codegen':False;'abort_on_error':1;;
> SET test_replan=1;
> SET mt_dop=1;
> SET batch_size=0;
> SET num_nodes=0;
> SET disable_codegen_rows_threshold=0;
> SET disable_codegen=False;
> SET abort_on_error=1;
> SET exec_single_node_rows_threshold=0;
> -- 2024-02-07 11:06:41,086 INFO     MainThread: Loading query test file: 
> /data/jenkins/workspace/impala-asf-master-core-s3-data-cache/repos/Impala/testdata/workloads/functional-query/queries/QueryTest/scanners.test
> -- executing against localhost:21000
> select count(*),
>   sum(id), count(bool_col), sum(tinyint_col), sum(smallint_col),
>   sum(int_col), sum(bigint_col), max(float_col), max(double_col),
>   max(date_string_col), max(string_col), max(timestamp_col)
> from alltypesagg
> where id % 2 = 0 and day is not null;
> -- 2024-02-07 11:06:43,214 INFO     MainThread: Started query 
> fa4256b842b4ca06:d121dbbf00000000
> -- executing against localhost:21000
> select sum(t1.id), sum(t1.int_col),max(t1.date_string_col), max(t2.string_col)
> from alltypesagg t1
> inner join alltypesagg t2
>   on t1.id = t2.id and t1.day is not null and t2.day is not null;
> -- 2024-02-07 11:06:43,805 INFO     MainThread: Started query 
> d54c506ddf7aab85:f8a09ce600000000
> -- executing against localhost:21000
> select id, bool_col, int_col
> from alltypesagg where day is not null
> order by 1 desc, 2 desc, 3 desc
> limit 10;
> -- 2024-02-07 11:06:44,023 INFO     MainThread: Started query 
> 1e4c81bf3989f31f:c3b3b67000000000
> -- executing against localhost:21000
> select count(*)
> from nulltable;
> -- 2024-02-07 11:06:44,131 INFO     MainThread: Started query 
> ba48e2f6736341e1:a21ad19300000000
> -- executing against localhost:21000
> select count(*)
> from nulltable where b = '';
> -- 2024-02-07 11:06:44,191 INFO     MainThread: Started query 
> f6414ed0b07a6ab2:c229fef700000000
> -- executing against localhost:21000
> select a,b
> from nulltable where b = '';
> -- 2024-02-07 11:06:44,300 INFO     MainThread: Started query 
> 3d4db322071343d6:6f502d3f00000000
> -- executing against localhost:21000
> select count(*) from alltypes where rand() * 10 >= 0.0;
> -- 2024-02-07 11:06:44,408 INFO     MainThread: Started query 
> 294652aec4885cad:0849ff0800000000
> -- executing against localhost:21000
> select count(*) from alltypes where rand() * 10 < 0.0;
> -- 2024-02-07 11:06:44,516 INFO     MainThread: Started query 
> 7f466c573914447e:65c6ca1800000000
> -- executing against localhost:21000
> select count(*) from alltypes where rand() - year > month;
> -- 2024-02-07 11:06:44,624 INFO     MainThread: Started query 
> b641073231fd65d1:0f9c184200000000
> -- executing against localhost:21000
> select count(v.x) from alltypestiny t3 left outer join (
>   select true as x from alltypestiny t1 left outer join
>   alltypestiny t2 on (true)) v
> on (v.x = t3.bool_col) where t3.bool_col = true;
> -- 2024-02-07 11:06:44,737 INFO     MainThread: Started query 
> c44facd1194ba102:48e6531a00000000
> -- executing against localhost:21000
> select * from emptytable;
> -- 2024-02-07 11:06:44,955 INFO     MainThread: Started query 
> 404cbcb82628ae02:10b500a900000000
> -- executing against localhost:21000
> set max_scan_range_length=1;
> -- 2024-02-07 11:06:45,010 INFO     MainThread: Started query 
> 30456d0b30ca6353:9d00b58700000000
> -- executing against localhost:21000
> select count(*) from alltypessmall;
> -- 2024-02-07 11:06:45,014 INFO     MainThread: Started query 
> 6f47560afcee03e9:e6a9bc0f00000000
> -- executing against localhost:21000
> SET MAX_SCAN_RANGE_LENGTH="0";
> -- 2024-02-07 11:06:45,675 INFO     MainThread: Started query 
> da4cb660d639198e:3797933000000000
> -- 2024-02-07 11:06:45,688 ERROR    MainThread: Comparing QueryTestResults 
> (expected vs actual):
> 100 != 378250{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to