Yes. Testing all the timezones is not needed. Xiao
On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk <maxim.g...@databricks.com> wrote: > Hi All, > > I believe we should also take into account what we test, for example, I > don't think it makes sense to check all timezones for JSON/CSV > functions/datasources because those timezones are just passed to external > libraries. So, the same code is involved into testing of each out of 650 > timezones. We basically just spend time and resources on testing the > external libraries. > > I mean the PRs: https://github.com/apache/spark/pull/22657 and > https://github.com/apache/spark/pull/22379#discussion_r223039662 > > Maxim Gekk > > Technical Solutions Lead > > Databricks B. V. <http://databricks.com/> > > > On Mon, Oct 8, 2018 at 4:49 PM Sean Owen <sro...@gmail.com> wrote: > >> If the problem is simply reducing the wall-clock time of tests, then >> even before we get to this question, I'm advocating: >> >> 1) try simple parallelization of tests within the suite. In this >> instance there's no reason not to test these in parallel and get a 8x >> or 16x speedup from cores. This assumes, I believe correctly, that the >> machines aren't generally near 100% utilization >> 2) explicitly choose a smaller, more directed set of cases to test >> >> Randomly choosing test cases with a fixed seed is basically 2, but not >> choosing test cases for a particular reason. You can vary the seed but >> as a rule the same random subset of tests is always chosen. Could be >> fine if there's no reason at all to prefer some cases over others. But >> I am guessing any wild guess at the most important subset of cases to >> test is better than random. >> >> I'm trying 1) right now instead in these several cases. >> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li <lix...@databricks.com> wrote: >> > >> > For this specific case, I do not think we should test all the timezone. >> If this is fast, I am fine to leave it unchanged. However, this is very >> slow. Thus, I even prefer to reducing the tested timezone to a smaller >> number or just hardcoding some specific time zones. >> > >> > In general, I like Reynold’s idea by including the seed value and we >> add the seed name in the test case name. This can help us reproduce it. >> > >> > Xiao >> > >> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin <r...@databricks.com> wrote: >> >> >> >> I'm personally not a big fan of doing it that way in the PR. It is >> perfectly fine to employ randomized tests, and in this case it might even >> be fine to just pick couple different timezones like the way it happened in >> the PR, but we should: >> >> >> >> 1. Document in the code comment why we did it that way. >> >> >> >> 2. Use a seed and log the seed, so any test failures can be reproduced >> deterministically. For this one, it'd be better to pick the seed from a >> seed environmental variable. If the env variable is not set, set to a >> random seed. >> >> >> >> >> >> >> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen <sro...@gmail.com> wrote: >> >>> >> >>> Recently, I've seen 3 pull requests that try to speed up a test suite >> >>> that tests a bunch of cases by randomly choosing different subsets of >> >>> cases to test on each Jenkins run. >> >>> >> >>> There's disagreement about whether this is good approach to improving >> >>> test runtime. Here's a discussion on one that was committed: >> >>> https://github.com/apache/spark/pull/22631/files#r223190476 >> >>> >> >>> I'm flagging it for more input. >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>