I don't see a clear solution from that mailing thread: simply keeping
a TaskTrackerChild running longer won't solve the problem nicely
because tasks from different jobs should have different classpaths,
and I guess this is only supported in later versions of hadoop.

One simple way to go is to add the jars to hadoop-env.sh (which will
add those jars to the classpath to TaskTracker). This is not a nice
solution but it does give us all the performance gain no matter which
hadoop version we are using.

I think a better solution would be to add an option
"mapred.local.classpath" to JobConf - which specifies the path of jars
on the machines in the cluster. This should be done in the hadoop land
- at the beginning of the main function in TaskTracker.Child (if
TaskTracker.Child is reused, then we need to reset the classpath each
time it is running a new task)

What do you think?

Zheng

On Thu, Jul 30, 2009 at 11:54 AM, Edward Capriolo<[email protected]> wrote:
> On Fri, Jul 24, 2009 at 1:45 PM, Edward Capriolo<[email protected]> wrote:
>> On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shao<[email protected]> wrote:
>>> Hive only needs to be installed at the node that runs the hive query.
>>> All the jars will be sent to the hadoop JobClient via -libjars. The
>>> code is in ExecDriver.java.
>>>
>>> In hadoop 0.17, I don't think there is a way to add a path to
>>> classpath for a job (unless we put it in hadoop-env.sh and start
>>> TaskTracker with that path). are there any changes in the latter
>>> versions?
>>>
>>>
>>>
>>> Zheng
>>>
>>>
>>>
>>> On 7/24/09, Edward Capriolo <[email protected]> wrote:
>>>> I have been following some threads on the hadoop mailing list about
>>>> speeding up MR jobs. I have a few questions I am sure I can find the
>>>> answer to if I dig into the source code but I thought I could get a
>>>> quick answer.
>>>>
>>>> 1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
>>>> distributed cache has some overhead. I know if I create an auxlibs
>>>> directory under hive root, they will be added to libjars on startup.
>>>> If i add my jar to auxlibs on all my nodes will a UDF in the jar be
>>>> available during subsequent jobs? Or is it only necessary to add those
>>>> jars to the auxlib on the node I start the job from.
>>>>
>>>> 2 Dealing with the entire hive install. How much of the hive install
>>>> really needs to be replication on each datanode? If we used
>>>> distributed cache for everything the jobs would have unneeded
>>>> overhead, but hive would be 'installed on demand' from the client.
>>>>
>>>> Thanks,
>>>> Edward
>>>>
>>>
>>> --
>>> Sent from Gmail for mobile | mobile.google.com
>>>
>>> Yours,
>>> Zheng
>>>
>>
>> Zheng,
>>
>> A thread from the  hadoop list peaked my interest. search.
>> "hadoop jobs take long time to setup"
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%[email protected]%3e
>>
>> Can hive benefit?
>> Edward
>>
>
> Could we use something like this for a performance increase? With the
> assumption that the jars are present on all task-trackers could we
> have an alternate invocation script such as bin/hive-local ?
>
> Edward
>



-- 
Yours,
Zheng

Reply via email to