On 6/8/06, James G. Sack (jim) <[EMAIL PROTECTED]> wrote:
John Oliver wrote:
> On Thu, Jun 08, 2006 at 01:49:16PM -0700, Michael O'Keefe wrote:
>> John Oliver wrote:
>>> So I wrote a bash script that goes through Apache log files, looks for
>>> certain elements, counts how many times each element appears, and prints
>>> out the result. It took 4 hours to run against ~470MB of logs, which
>>> were accumulated in seven days with not much going on, compared to what
>>> we expect to see in the coming months.
>> I suppose it really depends on what you're counting, and how you're
>> counting it.
>>
>> I only have about 30MiB of logs to run my scans through to find (for
>> example) the most common referrers, google search terms, unique IP
>> connects etc...
>> But it only takes a few seconds to get me the results.
>> But your logs are more than 15 times larger, and I don't know what you
>> are looking for
>
> I start by grepping for lines that include "project.xml", and then grep
> -v lines that include a couple of other strings of characters.
> Everything that's left goes through a couple of cuts to get the field I
> want. That output is sorted and run through uniq to find out how many
> different elements there are, and then I use a loop with the results of
> uniq to go back through the sorted list to count how many times each
> element appears.
>
> FWIW, there are about 2.2 million lines in my sample.
>
You surely want to avoid multiple scans!
There are other alternatives, but perl would seem to be a good choice --
you can do the counting (automatically producing the unique requirement)
with a single hash, and if needed, sort the results at the end.
With this much data you maybe should also think about simple substring
searches if regular expressions are _really_ not actually required.
..but the most important goal is to get all the raw information in one
pass of the original data.
Note: if your output data set does NOT fit in ram, then you have to do
still something else <heh>.
I would hope that everything is done in one pipeline and not as
separate runs. But then my answer would be " don't use grep, grep,
cut, cut, sort, uniq, something else". This is too many processes.
At worst, it sounds like a job for gawk (replace grep,grep,cut,cut)
piped to the standard idiom: sort | uniq -c
carl
--
carl lowenstein marine physical lab u.c. san diego
[EMAIL PROTECTED]
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list