Could just be that Cassandra has changed the way their splits generate? Was 
Cassandra client libs changed at any point? Have you looked at its input 
formats' sources?

On 04-Nov-2011, at 10:05 PM, Brendan W. wrote:

> Plain Java MR, using the Cassandra inputFormat to read out of Cassandra.
> 
> Perhaps somebody hacked the inputFormat code on me...
> 
> But what's weird is that the parameter mapred.map.tasks didn't appear in
> the job confs before at all.  Now it does, with a value of 20 (happens to
> be the # of machines in the cluster), and that's without the jobs or the
> mapred-site.xml files themselves changing.
> 
> The inputSplitSize is set specifically in the jobs, and has not been
> changed (except I subsequently fiddled with it a little to see if it
> affected the fact that I was getting 20 splits, and it didn't affect
> that...just the split size, not the number).
> 
> After a submit the job, I get a message "TOTAL NUMBER OF SPLIT = 20",
> before a list of the input splits...sort of looks like a hack but I can't
> find where it is.
> 
> On Fri, Nov 4, 2011 at 11:58 AM, Harsh J <[email protected]> wrote:
> 
>> Brendan,
>> 
>> Are these jobs (whose split behavior has changed) via Hive/etc. or plain
>> Java MR?
>> 
>> In case its the former, do you have users using newer versions of them?
>> 
>> On 04-Nov-2011, at 8:03 PM, Brendan W. wrote:
>> 
>>> Hi,
>>> 
>>> In the jobs running on my cluster of 20 machines, I used to run jobs (via
>>> "hadoop jar ...") that would spawn around 4000 map tasks.  Now when I run
>>> the same jobs, that number is 20; and I notice that in the job
>>> configuration, the parameter mapred.map.tasks is set to 20, whereas it
>>> never used to be present at all in the configuration file.
>>> 
>>> Changing the input split size in the job doesn't affect this--I get the
>>> size split I ask for, but the *number* of input splits is still capped at
>>> 20--i.e., the job isn't reading all of my data.
>>> 
>>> The mystery to me is where this parameter could be getting set.  It is
>> not
>>> present in the mapred-site.xml file in <hadoop home>/conf on any machine
>> in
>>> the cluster, and it is not being set in the job (I'm running out of the
>>> same jar I always did; no updates).
>>> 
>>> Is there *anywhere* else this parameter could possibly be getting set?
>>> I've stopped and restarted map-reduce on the cluster with no
>> effect...it's
>>> getting re-read in from somewhere, but I can't figure out where.
>>> 
>>> Thanks a lot,
>>> 
>>> Brendan
>> 
>> 

Reply via email to