Hi Shevek/others, I tried this.
First job created about 78 files of each 15 MB size. I tried a second map only job with IdentityMapper with -Dmapred.min.split.size=1073741824 but it did not cause output files to be 1 Gb each but same output as above i.e. 78 files of 15 MB size. Is there a way to combine about files to 1 GB size each ? Thanks, -JJ On Fri, Oct 28, 2011 at 9:53 AM, Shevek <[email protected]> wrote: > If you run it as a pure map job, it will do it per split. If you run it as > a > single reducer job, it will do it overall. However, one starts to suspect > that by the time you've paid that extra cost, you might as well reconsider > your downstream process and the reason for this subdivision. > > S. > > On 27 October 2011 23:07, Mapred Learn <[email protected]> wrote: > > > Hi Shevek, > > Thanks for the explanation ! > > > > Can you point me to some documentatino for specifying size in output > format > > ? > > > > If i say size as 200 MB, then after 200 mb, it would do this per split or > > overall ? > > I mena would I end up with 200 mb and a 50 mb from 1st mapper and then, > say > > 200 mb and 10 mb from 2nd mapper and so on. Or will I get 200 mb files > only > > ? > > > > > > > > On Wed, Oct 26, 2011 at 10:48 AM, Shevek <[email protected]> wrote: > > > > > You can control the input to a computer program, but not (arbitrarily) > > how > > > much output it generates. The only way to generate output files of a > > fixed > > > size is to write a custom output format which shifts to a new filename > > > every > > > time that size is exceeded, but you will still get some small bits left > > > over. The plumbing in this is pretty ugly, and I would not recommend it > > > casually. > > > > > > You may be able to write a second map-only job which reprocesses the > > output > > > from the first job in chunks of X bytes, and just writes them out. Use > an > > > IdentityMapper and set the split size. I have not tried this at home. > > > > > > S. > > > > > > On 26 October 2011 07:03, Mapred Learn <[email protected]> wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > I am trying to create output files of fixed size by using : > > > > > -Dmapred.max.split.size=6442450812 (6 Gb) > > > > > > > > > > But the problem is that the input Data size and metadata varies > and > > I > > > > have to adjust above value manually to achieve fixed size. > > > > > > > > > > Is there a way I can programmatically determine split size that > would > > > > yield me fixed sized output files. For eg 200 MB each ? > > > > > > > > > > Thanks, > > > > > JJ > > > > > > > > > >
