@Carl: What would be the best way to check the tasktracker? I loaded
up the HWI and didn't see any jobs running. Could it be that it
wasn't finding the correct metastore (I was getting a jetty error when
I went to List Tables)? Or is there another way?
Also the file is just text, not gzip, but I see your point about the
file being unsplitable. Could the SerDe I was using have something to
do with it being un-splitable (see below)?
@Edward: I didn't try lowering the number of tasks—I was assuming that
since only 1 core was ever getting used there was only 1 map task
running. (BTW, what's the best way to see the current values for
these settings?) Is that a valid assumption to make?
I did manage to have success after changing the SerDe for the source
table. Before I was using:
CREATE TABLE apachelog (
ipaddress STRING, ident STRING, user STRING, finishtime STRING,
request string, returncode INT, size INT, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
WITH SERDEPROPERTIES (
'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
'quote.delim'='("|\\[|\\])',
'field.delim'=' ',
'serialization.null.format'='-')
STORED AS TEXTFILE;
Based on the examples I found here:
http://www.johnandcailin.com/blog/cailin/exploring-apache-log-files-using-hive-and-hadoop
Now, I'm preprocessing using a script that reads an access log from
STDIN and writes a tab formated version to STDOUT without any quote
characters (eventually I'll looking into Hadoop streaming when I'm
dealing with bigger files). Then I create a table in Hive using the
tab version using:
CREATE TABLE access_log (
ip STRING, ident STRING, user STRING, time STRING,
method STRING, resource STRING, protocol STRING,
status INT, length INT, referer STRING, agent STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY
"\n" STORED AS TEXTFILE
Massive time difference—about 4 minutes to run the preprocess script
and my CTAS query that I posted in the beginning of the thread ran in
about a minute.
So I'm wondering: is it safe to assume that my previous SerDe was the
problem? Which SerDe does "FORMAT DELIMITED FIELDS..." use (and is it
just syntactic sugar for a SERDE... SERDEPROPERTIES declaration or is
there more going on there)?
As an observation: I think my problem was coming from an RDBMS
background I improperly assumed that by creating a table, Hive would
somehow optimize the underlying data (and therefore surely that
couldn't be the problem with this query!). I later read the part
about Hive not taking ownership of the data. Not sure if I have a
point here—just wondering if other people had similar problems when
they started out.
Thanks for all your help, everyone.
–Andrew
On Tue, Nov 17, 2009 at 5:08 PM, Edward Capriolo <[email protected]> wrote:
> On Tue, Nov 17, 2009 at 2:24 PM, Andrew O'Brien <[email protected]>
> wrote:
>> Hi everyone,
>>
>> So I'm evaluating Hive for an Apache access log processing job (who
>> isn't? ;) and for testing I've got a logfile that's about 1 million
>> lines/245MB. I've loaded it into a table and now I want to extract
>> out some ids from the request urls and filter out any requests without
>> any ids. Here's the query I'm running:
>>
>> CREATE TABLE access_with_company_and_product AS
>> SELECT * FROM (
>> SELECT ipaddress, ident, user, finishtime,
>> request, returncode, size, referer, agent,
>> regexp_extract(request, '/products/(\\d+)', 1) AS product_id,
>> regexp_extract(request, '/companies/(\\d+)', 1) AS company_id
>> FROM apachelog
>> ) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL;
>>
>> It's been going for about 3 hours now and says it's only 2% through
>> the map. So I'm wondering is this the normal rate or am I doing
>> something particularly inefficient here? Or have I missed a
>> configuration setting?
>>
>> I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the
>> stock configuration (Hive trunk, I'm pretty sure). At any one point,
>> it appears that only 1 core is really running at full and I've had at
>> least a couple hundred MB of memory free the whole time.
>>
>> Any advice would be very appreciated.
>>
>> –Andrew
>>
>
> Also with small datasets it can be better to force the number of maps
> and reducers lower. Sometimes maps:1 and reduces:1 is better the
> default.
>
> set mapred.map.tasks=1
> set mapred.reduce.tasks=1
>
> Or try 5-3 you need to find appropriate values based on the input size.
>
--
–Andrew