John Oliver wrote: > On Thu, Jun 08, 2006 at 01:49:16PM -0700, Michael O'Keefe wrote: >> John Oliver wrote: >>> So I wrote a bash script that goes through Apache log files, looks for >>> certain elements, counts how many times each element appears, and prints >>> out the result. It took 4 hours to run against ~470MB of logs, which >>> were accumulated in seven days with not much going on, compared to what >>> we expect to see in the coming months. >> I suppose it really depends on what you're counting, and how you're >> counting it. >> >> I only have about 30MiB of logs to run my scans through to find (for >> example) the most common referrers, google search terms, unique IP >> connects etc... >> But it only takes a few seconds to get me the results. >> But your logs are more than 15 times larger, and I don't know what you >> are looking for > > I start by grepping for lines that include "project.xml", and then grep > -v lines that include a couple of other strings of characters. > Everything that's left goes through a couple of cuts to get the field I > want. That output is sorted and run through uniq to find out how many > different elements there are, and then I use a loop with the results of > uniq to go back through the sorted list to count how many times each > element appears. > > FWIW, there are about 2.2 million lines in my sample. >
You surely want to avoid multiple scans! There are other alternatives, but perl would seem to be a good choice -- you can do the counting (automatically producing the unique requirement) with a single hash, and if needed, sort the results at the end. With this much data you maybe should also think about simple substring searches if regular expressions are _really_ not actually required. ..but the most important goal is to get all the raw information in one pass of the original data. Note: if your output data set does NOT fit in ram, then you have to do still something else <heh>. ..jim -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
