We're experiencing a recurring problem and wondering if others have seen this:
We're mapping a table with about 2 million rows in 100 regions on 40 nodes. In each map, we're doing a random read on the same table. We're encountering a situation that looks alot like deadlock. When the job is launched, some of the tasktrackers appear to get blocked in doing the first random read. The only trace we get is an eventual Unknown Scanner Exception in the RegionServer log, at which point the task is actually reported as successfully completed by MapReduce (1 row processed). There is no error in the task's log. The job completes as SUCCESSFUL with an incomplete number of rows. In the worst case scenario, we've actually seen ALL the tasktrackers encounter this problem; the job completes succesfully with 100 rows processed (1 per region). When we remove the code that does the random read in the map, there are no problems. Anyone? This is driving me crazy because I can't reproduce it locally (it only seems to be a problem in a distributed environment with many nodes) and because there is no stacktrace besides the scanner exception (which is clearly a symptom, not a cause). j