Re: How long should this take?

Andrew O'Brien Wed, 18 Nov 2009 08:53:15 -0800

@Carl: What would be the best way to check the tasktracker?  I loaded
up the HWI and didn't see any jobs running.  Could it be that it
wasn't finding the correct metastore (I was getting a jetty error when
I went to List Tables)?  Or is there another way?

Also the file is just text, not gzip, but I see your point about the
file being unsplitable.  Could the SerDe I was using have something to
do with it being un-splitable (see below)?

@Edward: I didn't try lowering the number of tasks—I was assuming that
since only 1 core was ever getting used there was only 1 map task
running.  (BTW, what's the best way to see the current values for
these settings?)  Is that a valid assumption to make?

I did manage to have success after changing the SerDe for the source
table.  Before I was using:

CREATE TABLE apachelog (
  ipaddress STRING, ident STRING, user STRING, finishtime STRING,
  request string, returncode INT, size INT, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
WITH SERDEPROPERTIES (

'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
  'quote.delim'='("|\\[|\\])',
  'field.delim'=' ',
  'serialization.null.format'='-')
STORED AS TEXTFILE;

Based on the examples I found here:
http://www.johnandcailin.com/blog/cailin/exploring-apache-log-files-using-hive-and-hadoop

Now, I'm preprocessing using a script that reads an access log from
STDIN and writes a tab formated version to STDOUT without any quote
characters (eventually I'll looking into Hadoop streaming when I'm
dealing with bigger files).  Then I create a table in Hive using the
tab version using:

CREATE TABLE access_log (
    ip STRING, ident STRING, user STRING, time STRING,
    method STRING, resource STRING, protocol STRING,
    status INT, length INT, referer STRING, agent STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY
"\n" STORED AS TEXTFILE

Massive time difference—about 4 minutes to run the preprocess script
and my CTAS query that I posted in the beginning of the thread ran in
about a minute.

So I'm wondering: is it safe to assume that my previous SerDe was the
problem?  Which SerDe does "FORMAT DELIMITED FIELDS..." use (and is it
just syntactic sugar for a SERDE... SERDEPROPERTIES declaration or is
there more going on there)?

As an observation: I think my problem was coming from an RDBMS
background I improperly assumed that by creating a table, Hive would
somehow optimize the underlying data (and therefore surely that
couldn't be the problem with this query!).  I later read the part
about Hive not taking ownership of the data.  Not sure if I have a
point here—just wondering if other people had similar problems when
they started out.

Thanks for all your help, everyone.
–Andrew

On Tue, Nov 17, 2009 at 5:08 PM, Edward Capriolo <[email protected]> wrote:
> On Tue, Nov 17, 2009 at 2:24 PM, Andrew O'Brien <[email protected]> 
> wrote:
>> Hi everyone,
>>
>> So I'm evaluating Hive for an Apache access log processing job (who
>> isn't? ;) and for testing I've got a logfile that's about 1 million
>> lines/245MB.  I've loaded it into a table and now I want to extract
>> out some ids from the request urls and filter out any requests without
>> any ids.  Here's the query I'm running:
>>
>> CREATE TABLE access_with_company_and_product AS
>> SELECT * FROM (
>>  SELECT ipaddress, ident, user, finishtime,
>>    request, returncode, size, referer, agent,
>>  regexp_extract(request, '/products/(\\d+)', 1) AS product_id,
>>  regexp_extract(request, '/companies/(\\d+)', 1) AS company_id
>>  FROM apachelog
>> ) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL;
>>
>> It's been going for about 3 hours now and says it's only 2% through
>> the map.  So I'm wondering is this the normal rate or am I doing
>> something particularly inefficient here?  Or have I missed a
>> configuration setting?
>>
>> I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the
>> stock configuration (Hive trunk, I'm pretty sure).  At any one point,
>> it appears that only 1 core is really running at full and I've had at
>> least a couple hundred MB of memory free the whole time.
>>
>> Any advice would be very appreciated.
>>
>> –Andrew
>>
>
> Also with small datasets it can be better to force the number of maps
> and reducers lower. Sometimes maps:1 and reduces:1 is better the
> default.
>
> set mapred.map.tasks=1
> set mapred.reduce.tasks=1
>
> Or try 5-3 you need to find appropriate values based on the input size.
>

-- 

–Andrew

Re: How long should this take?

Reply via email to