There are various ways to massage logs such as Apache into a Hadoop/Hive/Pig
usable form (Pig even has a builtin CombinedLogLoader) but has anyone taken
the road of simply changing their Apache log format to be more generally
amenable in it's raw form? Assuming you are only processing logs moving
forward from such a config change (and no other process is consuming them,
and change mgmt not so hard, ...) you can replace the legacy combined log
format with something where every field is well broken out and tab is the
only necessary delimter e.g.

"%h\t%l\t%u\t%{%Y-%m-%d
%H:%M:%S}t\t%m\t%U\t%q\t%H\t%>s\t%b\t%D\t%{Referer}i\t%{User-agent}i\t%{MyCookie}C"

where %r = %m\t%U\t%q\t%H (in almost all cases), the timestamp isn't
localized, and tab will always be escaped if output as other than a
delimiter. This would be immediatly usable in Hive with a tab delimited
table definition and every field in its own predefined column.

You are tieing the two fixed schemas together which loses flexibility but
log format changes of this nature should be rare and you can mandate new
columns are at the end. I'd take the efficiency of not writing in a loosely
standard format that always needs to be massaged for the gain of just
focusing on meaningful translations such as geo IP lookup. Though if you
don't have a choice clearly Hadoop can plow through regardless.

On Fri, Jul 24, 2009 at 5:27 AM, Saurabh Nanda <[email protected]>wrote:

> Hi Zheng,
>
> Thanks for the reply, but I gave up on UDFs & SerDe and resorted to custom
> map/reduce scripts instead. In case you're interested, I've written about my
> Hive experience at
> http://nandz.blogspot.com/2009/07/using-hive-for-weblog-analysis.html
>
> Saurabh.
>
>
> On Thu, Jul 23, 2009 at 2:15 AM, Zheng Shao <[email protected]> wrote:
>
>> Hi Saurabh,
>>
>> Sorry for the late reply.
>>
>> You can create a table using this:
>> https://issues.apache.org/jira/browse/HIVE-637
>> And then use the newly added UDF:
>> https://issues.apache.org/jira/browse/HIVE-642
>> to read in the data.
>>
>> In this way, you won't need to write any Java code. Let us know if you
>> have any questions.
>>
>>
>> In the longer term, we want to let our users to write SerDe for that.
>> The benefit of SerDe is that you will be able to use column names,
>> instead of
>> split(blob, "\t")[0], split(blob, "\t")[1], split(blob, "\t")[2], etc.
>>
>> I didn't get time to write the SerDe how-to last week. Will start to
>> write it today.
>> The how-to will go into contrib directory (see
>> https://issues.apache.org/jira/browse/HIVE-639 ) and with some
>> examples.
>>
>> Zheng
>>
>> On Thu, Jul 16, 2009 at 1:17 AM, Saurabh Nanda<[email protected]>
>> wrote:
>> >
>> >
>> >> So, I'm back to square one. Is there *any* way I can do this using Hive
>> >> alone? I'm fine with running the data through multiple passes, putting
>> it in
>> >> temporary tables, if need be. Should I be looking at UDF or SerDe to
>> achieve
>> >> this?
>> >
>> > One way, I'm trying out is to have multiple UDFs, each taking the raw
>> log
>> > entry as input and returning a specific field. For example,
>> > extract_ip_address, extract_apache_uid, extract_uri, etc.
>> >
>> > Anything simpler?
>> >
>> > Saurabh.
>> > --
>> > http://nandz.blogspot.com
>> > http://foodieforlife.blogspot.com
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>

Reply via email to