On Sat, Aug 22, 2009 at 9:25 PM, Vijay<[email protected]> wrote: > Does it really though? When dealing with compressed files, is all the > decompression done in-memory? Is that realistic for all sizes of files? If > not, then doesn't the file have to be decompressed first to disk? > > On Fri, Aug 21, 2009 at 4:05 PM, Ashish Thusoo <[email protected]> wrote: >> >> From our experience it seems that lzo has better space/compute tradeoffs >> than gzip. Cpmpression does usually help as it also reduces the amount of >> data to be read from the disk and thus gets rid of a major bottleneck. >> >> Ashish >> ________________________________ >> From: Vijay [mailto:[email protected]] >> Sent: Thursday, August 20, 2009 4:45 PM >> To: [email protected] >> Subject: Using hive for (advanced) access log analysis >> >> Hi, I'm quite new to hive and so far everything has been working very >> good. I'm able to setup a small vm-based cluster, ingest a lot of our access >> logs and generate some pretty neat reports mostly to do with patterns of >> urls, etc. I'm looking for some advice on some more advanced forms of >> analysis from people who might have already done similar analysis. >> >> 1. First off, many of our daily logs are about 1GB raw in size, around >> 120MB compressed (gzip). I'm keeping the compressed files in hive. For these >> kind of numbers, is that good or bad? Obviously for every query hive has to >> decompress every file so may be it's not a good idea? Of course, there is a >> space/speed trade off as well. >> 2. What are some ideas for doing session-based analysis? For example, most >> visited urls, average visit length, and other kinds of "analytics" stuff. >> Are there any useful recipes that people can share here? >> >> Thanks in advance, >> Vijay > >
The answer to this is that some compression formats compress and split into blocks. LZO is one of those formats. As a result the decompression is done on the fly. You do NOT need to decompress the entire file to process it.
