>From our experience it seems that lzo has better space/compute tradeoffs than 
>gzip. Cpmpression does usually help as it also reduces the amount of data to 
>be read from the disk and thus gets rid of a major bottleneck.

Ashish

________________________________
From: Vijay [mailto:[email protected]]
Sent: Thursday, August 20, 2009 4:45 PM
To: [email protected]
Subject: Using hive for (advanced) access log analysis

Hi, I'm quite new to hive and so far everything has been working very good. I'm 
able to setup a small vm-based cluster, ingest a lot of our access logs and 
generate some pretty neat reports mostly to do with patterns of urls, etc. I'm 
looking for some advice on some more advanced forms of analysis from people who 
might have already done similar analysis.

1. First off, many of our daily logs are about 1GB raw in size, around 120MB 
compressed (gzip). I'm keeping the compressed files in hive. For these kind of 
numbers, is that good or bad? Obviously for every query hive has to decompress 
every file so may be it's not a good idea? Of course, there is a space/speed 
trade off as well.
2. What are some ideas for doing session-based analysis? For example, most 
visited urls, average visit length, and other kinds of "analytics" stuff. Are 
there any useful recipes that people can share here?

Thanks in advance,
Vijay

Reply via email to