Agree with Sean again, I am seeing these race conditions happen with HDFS very consistently.
See below: testDistributedLanczosSolverEVJCLI(org.apache.mahout.math.hadoop.decomposer.TestDistributedLanczosSolverCLI) Time elapsed: 66.102 sec <<< ERROR! java.lang.IllegalStateException: java.io.IOException: The distributed cache object file:/tmp/mahout-TestDistributedLanczosSolverCLI-7471344972447511552/tmp2/1370579360893548000/DistributedMatrix.times.inputVector/1370579360894119000 changed during the job from 6/7/13 12:29 AM to 6/7/13 12:29 AM at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:401) at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:475) at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:191) at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182) at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:124) at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:437) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:886) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1323) at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:279) at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:110) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:207) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:159) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:118) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:290) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.math.hadoop.decomposer.TestDistributedLanczosSolverCLI.testDistributedLanczosSolverEVJCLI(TestDistributedLanczosSolverCLI.java:128) completeJobToyExample(org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest) Time elapsed: 141.7 sec <<< ERROR! java.io.IOException: The distributed cache object file:///Users/smarthi/opensourceprojects/Mahout/core/target/mahout-ParallelALSFactorizationJobTest-7107915352722998272/tmp/U-0/part-m-00000 changed during the job from 6/7/13 12:19 AM to 6/7/13 12:19 AM at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:401) at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:475) at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:191) at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182) at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:124) at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:437) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912) at org.apache.hadoop.mapreduce.Job.submit(Job.java:500) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) at org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob.runSolver(ParallelALSFactorizationJob.java:329) at org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob.run(ParallelALSFactorizationJob.java:188) at org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest.explicitExample(ParallelALSFactorizationJobTest.java:105) at org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest.completeJobToyExample(ParallelALSFactorizationJobTest.java:64) I am reverting back to the old configuration to get a clean build, not to mention that my Macbook Pro (i7 processor 4-core, 8GB RAM, 5400rpm HDD) is being deep fried :) when I attempt running tests. ________________________________ From: Sean Owen <sro...@gmail.com> To: Mahout Dev List <dev@mahout.apache.org> Sent: Friday, June 7, 2013 8:51 AM Subject: Re: Random Errors Having looked at it recently -- no the parallelism is per-class, just for this reason. I suspect the problem is a race condition vis-a-vis HDFS. Usually some operate like a delete is visible a moment later when a job starts, but maybe not always. It could also be some internal source of randomness somewhere in a library that can't be controlled externally, but I find that an unlikely explanation for this. On Fri, Jun 7, 2013 at 1:03 PM, Sebastian Schelter <ssc.o...@googlemail.com> wrote: > I'm also getting errors on a test when executing all tests. Don't get > the error when I run the test in the IDE or via mvn on the commandline. > > Do we now also have intra-test class parallelism? If yes, is there a way > to disable this? > > --sebastian > > > On 07.06.2013 09:11, Ted Dunning wrote: >> This last one is actually more like a non-deterministic test that probably >> needs a restart strategy to radically decrease the probability of failure >> or needs a slightly more relaxed threshold. >> >> >> >> On Fri, Jun 7, 2013 at 7:32 AM, Grant Ingersoll <gsing...@apache.org> wrote: >> >>> Here's another one: >>> testClustering(org.apache.mahout.clustering.streaming.cluster.BallKMeansTest) >>> Time elapsed: 2.817 sec <<< FAILURE! >>> java.lang.AssertionError: expected:<625.0> but was:<753.0> >>> at org.junit.Assert.fail(Assert.java:88) >>> at org.junit.Assert.failNotEquals(Assert.java:743) >>> at org.junit.Assert.assertEquals(Assert.java:494) >>> at org.junit.Assert.assertEquals(Assert.java:592) >>> at >>> org.apache.mahout.clustering.streaming.cluster.BallKMeansTest.testClustering(BallKMeansTest.java:119) >>> >>> >>> I suspect that we still have issues w/ the parallel testing, as it doesn't >>> show up in repeated runs and it isn't consistent. >>> >>> On Jun 7, 2013, at 6:10 AM, Grant Ingersoll <gsing...@apache.org> wrote: >>> >>>> testTranspose(org.apache.mahout.math.hadoop.TestDistributedRowMatrix) >>> Time elapsed: 1.569 sec <<< ERROR! >>>> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory >>> file:/tmp/mahout-TestDistributedRowMatrix-8146721276637462528/testdata/transpose-24 >>> already exists >>>> at >>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) >>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:951) >>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912) >>>> at java.security.AccessController.doPrivileged(Native Method) >>>> at javax.security.auth.Subject.doAs(Subject.java:396) >>>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) >>>> at >>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912) >>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:886) >>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1323) >>>> at >>> org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:238) >>>> at >>> org.apache.mahout.math.hadoop.TestDistributedRowMatrix.testTranspose(TestDistributedRowMatrix.java:87) >>>> >>>> >>>> Anyone seen this? I'm guessing there are some conflicts due to order >>> methods are run in. >>> >>> -------------------------------------------- >>> Grant Ingersoll | @gsingers >>> http://www.lucidworks.com >>> >>> >>> >>> >>> >>> >> >