On Tue, Nov 17, 2009 at 2:24 PM, Andrew O'Brien <[email protected]> wrote: > Hi everyone, > > So I'm evaluating Hive for an Apache access log processing job (who > isn't? ;) and for testing I've got a logfile that's about 1 million > lines/245MB. I've loaded it into a table and now I want to extract > out some ids from the request urls and filter out any requests without > any ids. Here's the query I'm running: > > CREATE TABLE access_with_company_and_product AS > SELECT * FROM ( > SELECT ipaddress, ident, user, finishtime, > request, returncode, size, referer, agent, > regexp_extract(request, '/products/(\\d+)', 1) AS product_id, > regexp_extract(request, '/companies/(\\d+)', 1) AS company_id > FROM apachelog > ) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL; > > It's been going for about 3 hours now and says it's only 2% through > the map. So I'm wondering is this the normal rate or am I doing > something particularly inefficient here? Or have I missed a > configuration setting? > > I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the > stock configuration (Hive trunk, I'm pretty sure). At any one point, > it appears that only 1 core is really running at full and I've had at > least a couple hundred MB of memory free the whole time. > > Any advice would be very appreciated. > > –Andrew >
Also with small datasets it can be better to force the number of maps and reducers lower. Sometimes maps:1 and reduces:1 is better the default. set mapred.map.tasks=1 set mapred.reduce.tasks=1 Or try 5-3 you need to find appropriate values based on the input size.
