>From our experience it seems that lzo has better space/compute tradeoffs than >gzip. Cpmpression does usually help as it also reduces the amount of data to >be read from the disk and thus gets rid of a major bottleneck.
Ashish ________________________________ From: Vijay [mailto:[email protected]] Sent: Thursday, August 20, 2009 4:45 PM To: [email protected] Subject: Using hive for (advanced) access log analysis Hi, I'm quite new to hive and so far everything has been working very good. I'm able to setup a small vm-based cluster, ingest a lot of our access logs and generate some pretty neat reports mostly to do with patterns of urls, etc. I'm looking for some advice on some more advanced forms of analysis from people who might have already done similar analysis. 1. First off, many of our daily logs are about 1GB raw in size, around 120MB compressed (gzip). I'm keeping the compressed files in hive. For these kind of numbers, is that good or bad? Obviously for every query hive has to decompress every file so may be it's not a good idea? Of course, there is a space/speed trade off as well. 2. What are some ideas for doing session-based analysis? For example, most visited urls, average visit length, and other kinds of "analytics" stuff. Are there any useful recipes that people can share here? Thanks in advance, Vijay
