Hi everyone,

So I'm evaluating Hive for an Apache access log processing job (who
isn't? ;) and for testing I've got a logfile that's about 1 million
lines/245MB.  I've loaded it into a table and now I want to extract
out some ids from the request urls and filter out any requests without
any ids.  Here's the query I'm running:

CREATE TABLE access_with_company_and_product AS
SELECT * FROM (
  SELECT ipaddress, ident, user, finishtime,
    request, returncode, size, referer, agent,
  regexp_extract(request, '/products/(\\d+)', 1) AS product_id,
  regexp_extract(request, '/companies/(\\d+)', 1) AS company_id
  FROM apachelog
) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL;

It's been going for about 3 hours now and says it's only 2% through
the map.  So I'm wondering is this the normal rate or am I doing
something particularly inefficient here?  Or have I missed a
configuration setting?

I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the
stock configuration (Hive trunk, I'm pretty sure).  At any one point,
it appears that only 1 core is really running at full and I've had at
least a couple hundred MB of memory free the whole time.

Any advice would be very appreciated.

–Andrew

Reply via email to