Re: How long should this take?

Edward Capriolo Wed, 18 Nov 2009 09:20:27 -0800

The best way to test the task tracker?
Use the hadoop job tracker web interface, drill down on the job and
see how many tasks it is looking to spawn for this job.


What's the best way to see the current values for these settings?
The default values are those specified in hadoop.
mapred.map.tasks
mapred.reduce.tasks

This may not be your problem. I will explain when it COULD be a
problem. Suppose one has a rather larger cluster 30 + nodes and the
default settings are
mapred.map.tasks=60
mapred.reduce.tasks=20
dfs.block.size=128MB

Let's say you are processing 64 MB of data.

Hadoop/hive might decide to split that 64 MB of data up into 60 map
tasks. That is really inefficient as each job is spawned and operates
on only a small piece of data (split). Then it has to shuffle short
all those splits for the reduce phase.

I do not think that is your problem is that. Based on your last email
it seems that SerDe might have a bug/ performance issue. Someone also
mentioned operations may not be optim

> As an observation: I think my problem was coming from an RDBMS
> background I improperly assumed that by creating a table, Hive would
> somehow optimize the underlying data

Hive does not optimize the data. The data stays in its original form.
For more performance you can query the source data into other tables,
using bucketing and possibly compression to archive better
performance/size. Hive aims to work with any data through the SerDe
which stands for Serialize De-serialize, not force data into the 'Hive
format'.

The approach is more brute-force then a traditional RDBMS. Low latency
queries on small data sets suffer, but they should not suffer as much
as your original problem.


On Wed, Nov 18, 2009 at 11:52 AM, Andrew O'Brien
<[email protected]> wrote:
> @Carl: What would be the best way to check the tasktracker?  I loaded
> up the HWI and didn't see any jobs running.  Could it be that it
> wasn't finding the correct metastore (I was getting a jetty error when
> I went to List Tables)?  Or is there another way?
>
> Also the file is just text, not gzip, but I see your point about the
> file being unsplitable.  Could the SerDe I was using have something to
> do with it being un-splitable (see below)?
>
> @Edward: I didn't try lowering the number of tasks—I was assuming that
> since only 1 core was ever getting used there was only 1 map task
> running.  (BTW, what's the best way to see the current values for
> these settings?)  Is that a valid assumption to make?
>
> I did manage to have success after changing the SerDe for the source
> table.  Before I was using:
>
> CREATE TABLE apachelog (
>  ipaddress STRING, ident STRING, user STRING, finishtime STRING,
>  request string, returncode INT, size INT, referer STRING, agent STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
> WITH SERDEPROPERTIES (
>  'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
>  'quote.delim'='("|\\[|\\])',
>  'field.delim'=' ',
>  'serialization.null.format'='-')
> STORED AS TEXTFILE;
>
> Based on the examples I found here:
> http://www.johnandcailin.com/blog/cailin/exploring-apache-log-files-using-hive-and-hadoop
>
> Now, I'm preprocessing using a script that reads an access log from
> STDIN and writes a tab formated version to STDOUT without any quote
> characters (eventually I'll looking into Hadoop streaming when I'm
> dealing with bigger files).  Then I create a table in Hive using the
> tab version using:
>
> CREATE TABLE access_log (
>    ip STRING, ident STRING, user STRING, time STRING,
>    method STRING, resource STRING, protocol STRING,
>    status INT, length INT, referer STRING, agent STRING )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY
> "\n" STORED AS TEXTFILE
>
> Massive time difference—about 4 minutes to run the preprocess script
> and my CTAS query that I posted in the beginning of the thread ran in
> about a minute.
>
> So I'm wondering: is it safe to assume that my previous SerDe was the
> problem?  Which SerDe does "FORMAT DELIMITED FIELDS..." use (and is it
> just syntactic sugar for a SERDE... SERDEPROPERTIES declaration or is
> there more going on there)?
>
> As an observation: I think my problem was coming from an RDBMS
> background I improperly assumed that by creating a table, Hive would
> somehow optimize the underlying data (and therefore surely that
> couldn't be the problem with this query!).  I later read the part
> about Hive not taking ownership of the data.  Not sure if I have a
> point here—just wondering if other people had similar problems when
> they started out.
>
> Thanks for all your help, everyone.
> –Andrew
>
> On Tue, Nov 17, 2009 at 5:08 PM, Edward Capriolo <[email protected]> 
> wrote:
>> On Tue, Nov 17, 2009 at 2:24 PM, Andrew O'Brien <[email protected]> 
>> wrote:
>>> Hi everyone,
>>>
>>> So I'm evaluating Hive for an Apache access log processing job (who
>>> isn't? ;) and for testing I've got a logfile that's about 1 million
>>> lines/245MB.  I've loaded it into a table and now I want to extract
>>> out some ids from the request urls and filter out any requests without
>>> any ids.  Here's the query I'm running:
>>>
>>> CREATE TABLE access_with_company_and_product AS
>>> SELECT * FROM (
>>>  SELECT ipaddress, ident, user, finishtime,
>>>    request, returncode, size, referer, agent,
>>>  regexp_extract(request, '/products/(\\d+)', 1) AS product_id,
>>>  regexp_extract(request, '/companies/(\\d+)', 1) AS company_id
>>>  FROM apachelog
>>> ) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL;
>>>
>>> It's been going for about 3 hours now and says it's only 2% through
>>> the map.  So I'm wondering is this the normal rate or am I doing
>>> something particularly inefficient here?  Or have I missed a
>>> configuration setting?
>>>
>>> I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the
>>> stock configuration (Hive trunk, I'm pretty sure).  At any one point,
>>> it appears that only 1 core is really running at full and I've had at
>>> least a couple hundred MB of memory free the whole time.
>>>
>>> Any advice would be very appreciated.
>>>
>>> –Andrew
>>>
>>
>> Also with small datasets it can be better to force the number of maps
>> and reducers lower. Sometimes maps:1 and reduces:1 is better the
>> default.
>>
>> set mapred.map.tasks=1
>> set mapred.reduce.tasks=1
>>
>> Or try 5-3 you need to find appropriate values based on the input size.
>>
>
>
>
> --
>
> –Andrew
>

Re: How long should this take?

Reply via email to