Hi Andrew, It's possible that the map is taking so long because you only have a single map task running on your system. If this is a case, it's probably because the apachelog table is an external table that references a non-splittable file, for example a gzipped logfile. I recommend checking the tasktracker to determine how many map tasks are running. You can rule out this problem if you see more than one map task running.
Thanks. Carl On Tue, Nov 17, 2009 at 11:24 AM, Andrew O'Brien <[email protected]>wrote: > Hi everyone, > > So I'm evaluating Hive for an Apache access log processing job (who > isn't? ;) and for testing I've got a logfile that's about 1 million > lines/245MB. I've loaded it into a table and now I want to extract > out some ids from the request urls and filter out any requests without > any ids. Here's the query I'm running: > > CREATE TABLE access_with_company_and_product AS > SELECT * FROM ( > SELECT ipaddress, ident, user, finishtime, > request, returncode, size, referer, agent, > regexp_extract(request, '/products/(\\d+)', 1) AS product_id, > regexp_extract(request, '/companies/(\\d+)', 1) AS company_id > FROM apachelog > ) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL; > > It's been going for about 3 hours now and says it's only 2% through > the map. So I'm wondering is this the normal rate or am I doing > something particularly inefficient here? Or have I missed a > configuration setting? > > I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the > stock configuration (Hive trunk, I'm pretty sure). At any one point, > it appears that only 1 core is really running at full and I've had at > least a couple hundred MB of memory free the whole time. > > Any advice would be very appreciated. > > –Andrew >
