On Thu, Apr 1, 2010 at 12:03 AM, Shai Erera <ser...@gmail.com> wrote:
> Hi > > I'd like to summarize a discussion I had w/ Robert and Mike last night on > IRC, about the parallelism of tasks in Benchmark: > > For some reason, ever since parallel tasks were introduced, when I run 'ant > test' from the contrib/benchmark folder (or the root), the tests just hang > at some point, after WriteLineDocTaskTest finishes. What's very weird is > that it seems I'm the only one experiencing this, and so for a long time I > thought it's just a problem w/ my environment ... until yesterday when I did > a fresh checkout of trunk, to a fresh folder and project, and still the > tests stuck. > > Thread dump does not show anything relevant to Lucene code, but rather to > Ant. The main thread is waiting on > org/apache/tools/ant/taskdefs/Parallel.spinThreads, another on > org/apache/tools/ant/taskdefs/Execute.waitFor and two other on > java/io/FileInputStream.read. But nothing is related to Lucene code, > directly. Also annoyingly, but conveniently for debugging that issue, it > happens very consistently on my machine - sometimes the test passes, but 90% > hangs. > Running w/ -Drunsequential=1 consistently succeeds. > > We've explored different ways to understand the cause of the problem, and > came across several improvements and a workaround, but unfortunately not to > a definite resolution: > > * As a last resort, we can add runsequential property to benchmark > build.xml, which forces Benchmark tests to run sequentially. Since that's a > tiny package which takes a few seconds to run anyway, and parallelism > doesn't improve much (it actually runs slower, when it passes, on my > machine: parallel=15 sec, seq=11 sec), this might be acceptable. > > * Moving the junit temp files (such as that flag file) created to the temp > directory each test uses. This is actually a good thing to do anyway (thanks > Robert for spotting that), because it avoids accidental commits of such > files :), as well as doesn't clutter the main environment. We've done that > because when I hit CTR:+C to stop one of the runs which hung, we received a > FNFE on a junit flag "file is being accessed by another process" (something > like that), and thought this is related to the hangs I'm seeing. Anyway, > this file is attempted access by multiple JVMs concurrently, which seems > bad. > > * Explore the JUnit Formatter code under src/test, since it uses file > locking. I've disabled locks (using NoLockFactory), however the test still > hung. > > * Change common-build.xml threadsPerProcessor to '1' instead of '2'. We > think that might be a good thing to do anyway - if people run on machines > with just one CPU, threading is not expected to help much, as opposed to > running on multiple CPUs. But we don't want to enforce it on anyone, so we > think to change the default to '1', but introduce a property > 'threadsPerProcessor' which users will be able to set explicitly. > ** Surprisingly, when I set it to '1' or '10' (I run on dual-core Thinkpad > W500), the test consistently passes - it just doesn't like the value '2'. At > least it passed as long as I ran it, maybe a thread hang is lurking for me > around the corner somewhere. > > * We made sure the benchmark tests indeed read/write the test data files > from/to unique directories. But like I said - there is no hang in Lucene > code reported in the thread dump. > > It was very late last night when we stopped, and my eyes were tired, so I > didn't summarize it right away. Robert, I hope I've captured everything we > did, if not please add. > > Anyone's got any suggestions? It's unfortunate that I'm the only one > running into this problem, because whatever the suggestions are, you'll > probably need me to confirm them :). And I'm going away for 3 days (camping > - no internet ... well at least no laptop :)), so unless someone has a > suggestion within the coming few hours, we can continue that when I get > back. > > Shai > I think you got everything. I reopened the JIRA issue too (LUCENE-1709) and listed the things we can do for sure now, such as lowering threadsPerProcessor (and allowing someone to use a system property to override this) and fixing junit temp files to be in the temp directory. Additionally I would like to fix the ant library problem as mentioned there. it works great from the command-line but we should improve this for IDE-users, so they do not see a compile error. I am personally for the idea of adding the runsequential property to benchmark's build.xml, to force it to run serially. While I am unable to reproduce your problem, it does not surprise me, as I had a tough time trying to prevent benchmark tests from stepping on each others toes. -- Robert Muir rcm...@gmail.com