On Tue, Nov 17, 2009 at 2:24 PM, Andrew O'Brien <[email protected]> wrote:
> Hi everyone,
>
> So I'm evaluating Hive for an Apache access log processing job (who
> isn't? ;) and for testing I've got a logfile that's about 1 million
> lines/245MB.  I've loaded it into a table and now I want to extract
> out some ids from the request urls and filter out any requests without
> any ids.  Here's the query I'm running:
>
> CREATE TABLE access_with_company_and_product AS
> SELECT * FROM (
>  SELECT ipaddress, ident, user, finishtime,
>    request, returncode, size, referer, agent,
>  regexp_extract(request, '/products/(\\d+)', 1) AS product_id,
>  regexp_extract(request, '/companies/(\\d+)', 1) AS company_id
>  FROM apachelog
> ) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL;
>
> It's been going for about 3 hours now and says it's only 2% through
> the map.  So I'm wondering is this the normal rate or am I doing
> something particularly inefficient here?  Or have I missed a
> configuration setting?
>
> I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the
> stock configuration (Hive trunk, I'm pretty sure).  At any one point,
> it appears that only 1 core is really running at full and I've had at
> least a couple hundred MB of memory free the whole time.
>
> Any advice would be very appreciated.
>
> –Andrew
>

Also with small datasets it can be better to force the number of maps
and reducers lower. Sometimes maps:1 and reduces:1 is better the
default.

set mapred.map.tasks=1
set mapred.reduce.tasks=1

Or try 5-3 you need to find appropriate values based on the input size.

Reply via email to