Hi, I have a largish job running that, due to the quirks of the third party input format I'm using, has 280,000 map tasks. ( I know this is far from ideal but it's it'll do for me )
I'm passing this data (the common crawl web crawl dataset) through a visible-text-from-html extraction library (boilerpipe) which is struggling with _1_ particular task. It's hits a sequence of records that are _insanely_ slow to parse for some reason. Rather than a few minutes per split it's took 7+ hrs before I started explicitly trying to fail the task (hadoop job -fail-task). Since I'm running with bad record skipping I was hoping I could issue -fail-task a few times and ride over the bad records but it looks like there's quite a few there. Since it's only 1 of the 280,000 I'm actually happy to just give up on the entire split. Now if I was running a map only job I'd just kill the job since I'd have the output of the other 279,999. This job has a no-op reduce step though since I wanted to take the chance to compact the output into a much smaller number of sequence files ( I regret that decision now) As such I can't just kill the job since I'd lose the rest of the processed data (if I understand correctly?) So does anyone know a way to just abandon the entire split? Cheers, Mat