Does it really though? When dealing with compressed files, is all the decompression done in-memory? Is that realistic for all sizes of files? If not, then doesn't the file have to be decompressed first to disk?
On Fri, Aug 21, 2009 at 4:05 PM, Ashish Thusoo <[email protected]> wrote: > From our experience it seems that lzo has better space/compute tradeoffs > than gzip. Cpmpression does usually help as it also reduces the amount of > data to be read from the disk and thus gets rid of a major bottleneck. > > Ashish > > ------------------------------ > *From:* Vijay [mailto:[email protected]] > *Sent:* Thursday, August 20, 2009 4:45 PM > *To:* [email protected] > *Subject:* Using hive for (advanced) access log analysis > > Hi, I'm quite new to hive and so far everything has been working very good. > I'm able to setup a small vm-based cluster, ingest a lot of our access logs > and generate some pretty neat reports mostly to do with patterns of urls, > etc. I'm looking for some advice on some more advanced forms of analysis > from people who might have already done similar analysis. > > 1. First off, many of our daily logs are about 1GB raw in size, around > 120MB compressed (gzip). I'm keeping the compressed files in hive. For these > kind of numbers, is that good or bad? Obviously for every query hive has to > decompress every file so may be it's not a good idea? Of course, there is a > space/speed trade off as well. > 2. What are some ideas for doing session-based analysis? For example, most > visited urls, average visit length, and other kinds of "analytics" stuff. > Are there any useful recipes that people can share here? > > Thanks in advance, > Vijay >
