Hi everyone,
So I'm evaluating Hive for an Apache access log processing job (who
isn't? ;) and for testing I've got a logfile that's about 1 million
lines/245MB. I've loaded it into a table and now I want to extract
out some ids from the request urls and filter out any requests without
any ids. Here's the query I'm running:
CREATE TABLE access_with_company_and_product AS
SELECT * FROM (
SELECT ipaddress, ident, user, finishtime,
request, returncode, size, referer, agent,
regexp_extract(request, '/products/(\\d+)', 1) AS product_id,
regexp_extract(request, '/companies/(\\d+)', 1) AS company_id
FROM apachelog
) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL;
It's been going for about 3 hours now and says it's only 2% through
the map. So I'm wondering is this the normal rate or am I doing
something particularly inefficient here? Or have I missed a
configuration setting?
I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the
stock configuration (Hive trunk, I'm pretty sure). At any one point,
it appears that only 1 core is really running at full and I've had at
least a couple hundred MB of memory free the whole time.
Any advice would be very appreciated.
–Andrew