Ramil's original problem is not how to read all those tens
of gigabytes of text data, but the more simple problem of keeping
a count of the number of lines read, since if wc uses int (fortunately
it does not), then wc can count only up to 2 billion lines. But he
expects to read up to 100 billion lines. Note that he does not need
to keep them in memory -- he only needs to count the number of lines.
I believe that wc is even an overkill, since the following simple
code will do the job:

unsigned long long n = 0;
while( (c = getchar()) != EOF) {
  if(c == '\n') ++n;
}
return n;

With this code, you do not even buffer the file, except for the
buffering (usually 4k) that the C library implementation of the
getchar() macro requires.


Pablo M.
***

--- On Wed, 2/18/09, Drexx Laggui [personal] <[email protected]> wrote:

> From: Drexx Laggui [personal] <[email protected]>
> Subject: Re: [plug] Maximum number of lines wc can count
> To: "Philippine Linux Users' Group (PLUG) Technical Discussion List" 
> <[email protected]>
> Date: Wednesday, February 18, 2009, 7:38 PM
> 18Feb2009 (UTC +8)
> 
> On Wed, Feb 18, 2009 at 7:08 PM, Pablo Manalastas
> <[email protected]> wrote:
> > uintmax_t can't count to 100 billion. But since
> you have coreutils
> > source, you can always recompile a version of wc using
> type
> > long long for total_lines.  Galing talaga ng open
> source!
> 
> Just an idea: why not re-compile wc.c to include code that
> calloc a
> buffer and read from it? Just like what "less"
> does. That way, you
> don't needs GB's of RAM to load and process the
> large text files in
> its entirety. Compare what "less' does to how
> "more" performs, and you
> can see what I'm imagining.
> 
> 
> > --- On Mon, 2/16/09, Ramil Galib
> <[email protected]> wrote:
> >> From: Ramil Galib <[email protected]>
> >> Subject: Re: [plug] Maximum number of lines wc can
> count
> >> To: "Philippine Linux Users' Group (PLUG)
> Technical Discussion List"
> <[email protected]>
> >> Date: Monday, February 16, 2009, 4:31 PM
> >>
> >> Since Google can't answer me, I download the
> coreutils
> >> source package in which wc belongs.
> >> the variable total_lines is of type uintmax_t.
> >> Now I'm getting somewhere.
> >>
> >> On 2/16/09, Roger Filomeno
> <[email protected]> wrote:
> >> I should since the whole file is not actually
> loaded into memory. Its jst
> >> going to take a lot of I/O
> >>
> >> > On Mon, Feb 16, 2009 at 12:27 PM, Ramil Galib
> <[email protected]> wrote:
> >> >
> >> >> I have text files ranging from 5 to 10
> GB.  I want to use wc -l to
> >> >> count the number of lines of these files.
> I expect  the number of lines
> >> >> to be from 50 to 100 billion.
> >> >> Can wc handle this?
> >> >> Hope you can help.
> >> >> Thanks!

_________________________________________________
Philippine Linux Users' Group (PLUG) Mailing List
http://lists.linux.org.ph/mailman/listinfo/plug
Searchable Archives: http://archives.free.net.ph

Reply via email to