If you'd like to contribute a patch to Impala, but aren't sure what you want to work on, you can look at Impala's newbie issues: https://issues.apache.org/jira/issues/?filter=12341668. You can find detailed instructions on submitting patches at https://cwiki.apache.org/confluence/display/IMPALA/Contributing+to+Impala. This is a walkthrough of a ticket a new contributor could take on, with hopefully enough detail to get you going but not so much to take away the fun.
How can we fix https://issues.apache.org/jira/browse/IMPALA-5440, "Add planner tests with extreme statistics values"? The comments on the ticket address a number of ways, some of them rather ambitious for a new contributor, so let's talk about a smaller chunk of it. This ticket was filed in response to https://issues.apache.org/jira/browse/IMPALA-5282, which included an exception in the frontend (which does parsing, analyzing, and planning for queries) from an overflow. Take a look at the patch which fixed the issue, https://gerrit.cloudera.org/#/c/7084. It doesn't include any new tests, which is why IMPALA-5440 was filed. You can see this in the comments on the patch: "For now, I feel pretty good about the computePerHostResources() with respect to overflow since I read all the code carefully. We should still have tests to not break it sometime later. I filed IMPALA-5440 to address the long-standing bug in test coverage." Reading the comments on a patch are a good way to understand why something in Impala is the way it is. All recent Impala patches have a line in the bottom of the commit message with a URL of the code review so you can do archaeology for information that wasn't included in the patch itself. All code review comments are also sent to https://lists.apache.org/[email protected], which you can subscribe to in the same way you subscribed to this list, by mailing [email protected]. In this case, the question to address is arithmetic overflow in the frontend. The previous patch shows many places where overflow is checked, and you may be able to add new tests for each line in that patch. For now, let's just work on two categories of overflow: cardinality estimation and memory estimates. Impala's planner, in order to execute a query efficiently, makes estimates about the number of rows that will be produced by different parts of the query. If cardinality estimations have arithmetic overflow, they will estimate a negative number of rows! To see if you can get arithmetic overflow, start up impala-shell.sh and set explain_level=2. This will show the planner's estimates on the number of rows each part of a query produces. Then explain the plans for some cross joins: use tpch; explain select * from lineitem a; explain select * from lineitem a, lineitem b; explain select * from lineitem a, lineitem b, lineitem c; ... At some point in that sequence, you will see that the cardinality estimate reaches a ceiling, even though those queries would actually produce more and more rows with each cross-join. This is because the overflow check is working and capping the cardinality estimate at the largest long value, 2^63 - 1. To see how to test this, take a look at fe/src/test/java/org/apache/impala/planner/PlannerTest.java. Each of the tests in that file references a file in testdata/workloads/functional-planner/queries/PlannerTest/. To look for a test that can check that cardinality is bounded, look for the string "cardinality" in the PlannerTest directory. Check out the test method in PlannerTest.java that corresponds, and write a similar test file and test method. Have fun, and don't hesitate to ask on [email protected] if you get stuck and need help!
