Could just be that Cassandra has changed the way their splits generate? Was Cassandra client libs changed at any point? Have you looked at its input formats' sources?
On 04-Nov-2011, at 10:05 PM, Brendan W. wrote: > Plain Java MR, using the Cassandra inputFormat to read out of Cassandra. > > Perhaps somebody hacked the inputFormat code on me... > > But what's weird is that the parameter mapred.map.tasks didn't appear in > the job confs before at all. Now it does, with a value of 20 (happens to > be the # of machines in the cluster), and that's without the jobs or the > mapred-site.xml files themselves changing. > > The inputSplitSize is set specifically in the jobs, and has not been > changed (except I subsequently fiddled with it a little to see if it > affected the fact that I was getting 20 splits, and it didn't affect > that...just the split size, not the number). > > After a submit the job, I get a message "TOTAL NUMBER OF SPLIT = 20", > before a list of the input splits...sort of looks like a hack but I can't > find where it is. > > On Fri, Nov 4, 2011 at 11:58 AM, Harsh J <[email protected]> wrote: > >> Brendan, >> >> Are these jobs (whose split behavior has changed) via Hive/etc. or plain >> Java MR? >> >> In case its the former, do you have users using newer versions of them? >> >> On 04-Nov-2011, at 8:03 PM, Brendan W. wrote: >> >>> Hi, >>> >>> In the jobs running on my cluster of 20 machines, I used to run jobs (via >>> "hadoop jar ...") that would spawn around 4000 map tasks. Now when I run >>> the same jobs, that number is 20; and I notice that in the job >>> configuration, the parameter mapred.map.tasks is set to 20, whereas it >>> never used to be present at all in the configuration file. >>> >>> Changing the input split size in the job doesn't affect this--I get the >>> size split I ask for, but the *number* of input splits is still capped at >>> 20--i.e., the job isn't reading all of my data. >>> >>> The mystery to me is where this parameter could be getting set. It is >> not >>> present in the mapred-site.xml file in <hadoop home>/conf on any machine >> in >>> the cluster, and it is not being set in the job (I'm running out of the >>> same jar I always did; no updates). >>> >>> Is there *anywhere* else this parameter could possibly be getting set? >>> I've stopped and restarted map-reduce on the cluster with no >> effect...it's >>> getting re-read in from somewhere, but I can't figure out where. >>> >>> Thanks a lot, >>> >>> Brendan >> >>
