Re: How to count rows of output files ?

JunYoung Kim Tue, 08 Mar 2011 06:46:11 -0800

actually, a structure of output directories are quite complexed.

A directory has 1, 2, 3 as output files 
B directory has 1, 2, 3, 4 as output files 
C directory has 1, 2, 3, 5 as output files


structure of directories, simply

2011  |- A |- 1
          |      |- 2
          |      |- 3
          |- B |- 1
          |      |- 2
          |      |- 3
          |      |- 4
          |- C |- 1
                 |- 2
                 |- 3
                 |- 4
                 |- 5

So, as  a result, I want to count A's 1, 2, 3 and B's 1, 2, 3, 4 and C's 1, 2, 
3, 5.
yes, I need to have 12 different(or same) numbers.

in this case, do you think your suggested solutions are suitable?


2011. 3. 8., 오후 9:29, James Seigel 작성:

> Simplest case, if you need a sum of the lines for A,B, and C is to
> look at the output that is normally generated which tells you "Reduce
> output records".  This can be accessed like the others are telling
> you, as a counter, which you could access and explicitly print out or
> with your eyes as the summary of the job when it is done.
> 
> Cheers
> James.
> 
> On Tue, Mar 8, 2011 at 3:29 AM, Harsh J <[email protected]> wrote:
>> I think the previous reply wasn't very accurate. So you need a count
>> per-file? One way I can think of doing that, via the job itself, is to
>> use Counter to count the "name of the output + the task's ID". But it
>> would not be a good solution if there are several hundreds of tasks.
>> 
>> A distributed count can be performed on a single file, however, using
>> an identity mapper + null output and then looking at map-input-records
>> counter after completion.
>> 
>> On Tue, Mar 8, 2011 at 3:54 PM, Harsh J <[email protected]> wrote:
>>> Count them as you sink using the Counters functionality of Hadoop
>>> Map/Reduce (If you're using MultipleOutputs, it has a way to enable
>>> counters for each name used). You can then aggregate related counters
>>> post-job, if needed.
>>> 
>>> On Tue, Mar 8, 2011 at 3:11 PM, Jun Young Kim <[email protected]> wrote:
>>>> Hi.
>>>> 
>>>> my hadoop application generated several output files by a single job.
>>>> (for example, A, B, C are generated as a result)
>>>> 
>>>> after finishing a job, I want to count each files' row counts.
>>>> 
>>>> is there any way to count each files?
>>>> 
>>>> thanks.

Re: How to count rows of output files ?

Reply via email to