John Oliver wrote:
> On Thu, Jun 08, 2006 at 01:49:16PM -0700, Michael O'Keefe wrote:
>> John Oliver wrote:
>>> So I wrote a bash script that goes through Apache log files, looks for
>>> certain elements, counts how many times each element appears, and prints
>>> out the result.  It took 4 hours to run against ~470MB of logs, which
>>> were accumulated in seven days with not much going on, compared to what
>>> we expect to see in the coming months.
>> I suppose it really depends on what you're counting, and how you're 
>> counting it.
>>
>> I only have about 30MiB of logs to run my scans through to find (for 
>> example) the most common referrers, google search terms, unique IP 
>> connects etc...
>> But it only takes a few seconds to get me the results.
>> But your logs are more than 15 times larger, and I don't know what you 
>> are looking for
> 
> I start by grepping for lines that include "project.xml", and then grep
> -v lines that include a couple of other strings of characters.
> Everything that's left goes through a couple of cuts to get the field I
> want.  That output is sorted and run through uniq to find out how many
> different elements there are, and then I use a loop with the results of
> uniq to go back through the sorted list to count how many times each
> element appears.
> 
> FWIW, there are about 2.2 million lines in my sample.
> 

You surely want to avoid multiple scans!

There are other alternatives, but perl would seem to be a good choice --
you can do the counting (automatically producing the unique requirement)
with a single hash, and if needed, sort the results at the end.

With this much data you maybe should also think about simple substring
searches if regular expressions are _really_ not actually required.

..but the most important goal is to get all the raw information in one
pass of the original data.

Note: if your output data set does NOT fit in ram, then you have to do
still something else <heh>.

..jim


-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Reply via email to