On Sat, Aug 22, 2009 at 9:25 PM, Vijay<[email protected]> wrote:
> Does it really though? When dealing with compressed files, is all the
> decompression done in-memory? Is that realistic for all sizes of files? If
> not, then doesn't the file have to be decompressed first to disk?
>
> On Fri, Aug 21, 2009 at 4:05 PM, Ashish Thusoo <[email protected]> wrote:
>>
>> From our experience it seems that lzo has better space/compute tradeoffs
>> than gzip. Cpmpression does usually help as it also reduces the amount of
>> data to be read from the disk and thus gets rid of a major bottleneck.
>>
>> Ashish
>> ________________________________
>> From: Vijay [mailto:[email protected]]
>> Sent: Thursday, August 20, 2009 4:45 PM
>> To: [email protected]
>> Subject: Using hive for (advanced) access log analysis
>>
>> Hi, I'm quite new to hive and so far everything has been working very
>> good. I'm able to setup a small vm-based cluster, ingest a lot of our access
>> logs and generate some pretty neat reports mostly to do with patterns of
>> urls, etc. I'm looking for some advice on some more advanced forms of
>> analysis from people who might have already done similar analysis.
>>
>> 1. First off, many of our daily logs are about 1GB raw in size, around
>> 120MB compressed (gzip). I'm keeping the compressed files in hive. For these
>> kind of numbers, is that good or bad? Obviously for every query hive has to
>> decompress every file so may be it's not a good idea? Of course, there is a
>> space/speed trade off as well.
>> 2. What are some ideas for doing session-based analysis? For example, most
>> visited urls, average visit length, and other kinds of "analytics" stuff.
>> Are there any useful recipes that people can share here?
>>
>> Thanks in advance,
>> Vijay
>
>

The answer to this is that some compression formats compress and split
into blocks. LZO is one of those formats. As a result the
decompression is done on the fly. You do NOT need to decompress the
entire file to process it.

Reply via email to