Re: load a serialized object in hadoop

Bharath Mundlapudi Wed, 13 Oct 2010 17:04:22 -0700

If you are running 32 bit JVM and depending on OS, you can't go beyond ~ 3GB. 
If you are on windows, 2GB heap is your best bet for 32bit JVM.



Try this:
Edit conf/hadoop-env.sh for 

export HADOOP_CLIENT_OPTS="-Xmx3G ${HADOOP_CLIENT_OPTS}"


and now you run your hadoop commands.


-Bharath





From: Shi Yu <[email protected]>
To: [email protected]
Cc: 
Sent: Wednesday, October 13, 2010 4:18:17 PM
Subject: Re: load a serialized object in hadoop

Hi, I tried the following five ways:

Approach 1: in command line
HADOOP_CLIENT_OPTS=-Xmx4000m bin/hadoop jar WordCount.jar OOloadtest


Approach 2: I added the hadoop-site.xml file with the following element. 
Each time I changed, I stop and restart hadoop on all the nodes.
...
<property>
<name>HADOOP_CLIENT_OPTS</name>
<value>-Xmx4000m</value>
</property>

run the command
$bin/hadoop jar WordCount.jar OOloadtest

Approach 3: I changed like this
...
<property>
<name>HADOOP_CLIENT_OPTS</name>
<value>4000m</value>
</property>
....

Then run the command:
$bin/hadoop jar WordCount.jar OOloadtest

Approach 4: To make sure, I changed the "m" to numbers, that was
...
<property>
<name>HADOOP_CLIENT_OPTS</name>
<value>4000000000</value>
</property>
....

Then run the command:
$bin/hadoop jar WordCount.jar OOloadtest

All these four approaches come to the same "Java heap space" error.

java.lang.OutOfMemoryError: Java heap space
         at 
java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
         at java.lang.StringBuilder.<init>(StringBuilder.java:68)
         at 
java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:2997)
         at 
java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2818)
         at 
java.io.ObjectInputStream.readString(ObjectInputStream.java:1599)
         at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1320)
         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
         at java.util.HashMap.readObject(HashMap.java:1028)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
         at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1846)
         at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
         at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
         at ObjectManager.loadObject(ObjectManager.java:42)
         at OOloadtest.main(OOloadtest.java:21)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


Approach 5:
In comparison, I called the Java command directly as follows (there is a 
counter showing how much time it costs if the serialized object is 
successfully loaded):

$java -Xms3G -Xmx3G -classpath 
.:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar OOloadtest

return:
object loaded, timing (hms): 0 hour(s) 1 minute(s) 12 second(s) 
162millisecond(s)


What was the problem in my command? Where can I find the documentation 
about HADOOP_CLIENT_OPTS? Have you tried the same thing and found it works?

Shi


On 2010-10-13 16:28, Luke Lu wrote:
> On Wed, Oct 13, 2010 at 2:21 PM, Shi Yu<[email protected]>  wrote:
>    
>> Hi,  thanks for the advice. I tried with your settings,
>> $ bin/hadoop jar Test.jar OOloadtest -D HADOOP_CLIENT_OPTS=-Xmx4000m
>> still no effect. Or this is a system variable? Should I export it? How to
>> configure it?
>>      
> HADOOP_CLIENT_OPTS is an environment variable so you should run it as
> HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop jar Test.jar OOloadtest
>
> if you use sh derivative shells (bash, ksh etc.) prepend env for other shells.
>
> __Luke
>
>
>    
>> Shi
>>
>>   java -Xms3G -Xmx3G -classpath
>> .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar:lib/commons-collections-3.2.1.jar:lib/stanford-postagger-2010-05-26.jar
>> OOloadtest
>>
>>
>> On 2010-10-13 15:28, Luke Lu wrote:
>>      
>>> On Wed, Oct 13, 2010 at 12:27 PM, Shi Yu<[email protected]>    wrote:
>>>
>>>        
>>>> I haven't implemented anything in map/reduce yet for this issue. I just
>>>> try
>>>> to invoke the same java class using   bin/hadoop  command.  The thing is
>>>> a
>>>> very simple program could be executed in Java, but not doable in
>>>> bin/hadoop
>>>> command.
>>>>
>>>>          
>>> If you are just trying to use bin/hadoop jar your.jar command, your
>>> code runs in a local client jvm and mapred.child.java.opts has no
>>> effect. You should run it with HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop
>>> jar your.jar
>>>
>>>
>>>        
>>>> I think if I couldn't get through the first stage, even I had a
>>>> map/reduce program it would also fail. I am using Hadoop 0.19.2. Thanks.
>>>>
>>>> Best Regards,
>>>>
>>>> Shi
>>>>
>>>> On 2010-10-13 14:15, Luke Lu wrote:
>>>>
>>>>          
>>>>> Can you post your mapper/reducer implementation? or are you using
>>>>> hadoop streaming? for which mapred.child.java.opts doesn't apply to
>>>>> the jvm you care about. BTW, what's the hadoop version you're using?
>>>>>
>>>>> On Wed, Oct 13, 2010 at 11:45 AM, Shi Yu<[email protected]>      wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> Here is my code. There is no Map/Reduce in it. I could run this code
>>>>>> using
>>>>>> java -Xmx1000m ,  however, when using  bin/hadoop  -D
>>>>>> mapred.child.java.opts=-Xmx3000M   it has heap space not enough error.
>>>>>>   I
>>>>>> have tried other program in Hadoop with the same settings so the memory
>>>>>> is
>>>>>> available in my machines.
>>>>>>
>>>>>>
>>>>>> public static void main(String[] args) {
>>>>>>    try{
>>>>>>              String myFile = "xxx.dat";
>>>>>>              FileInputStream fin = new FileInputStream(myFile);
>>>>>>              ois = new ObjectInputStream(fin);
>>>>>>              margintagMap = ois.readObject();
>>>>>>              ois.close();
>>>>>>              fin.close();
>>>>>>      }catch(Exception e){
>>>>>>          //
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> On 2010-10-13 13:30, Luke Lu wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> On Wed, Oct 13, 2010 at 8:04 AM, Shi Yu<[email protected]>
>>>>>>>   wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> As a coming-up to the my own question, I think to invoke the JVM in
>>>>>>>> Hadoop
>>>>>>>> requires much more memory than an ordinary JVM.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>> That's simply not true. The default mapreduce task Xmx is 200M, which
>>>>>>> is much smaller than the standard jvm default 512M and most users
>>>>>>> don't need to increase it. Please post the code reading the object (in
>>>>>>> hdfs?) in your tasks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> I found that instead of
>>>>>>>> serialization the object, maybe I could create a MapFile as an index
>>>>>>>> to
>>>>>>>> permit lookups by key in Hadoop. I have also compared the performance
>>>>>>>> of
>>>>>>>> MongoDB and Memcache. I will let you know the result after I try the
>>>>>>>> MapFile
>>>>>>>> approach.
>>>>>>>>
>>>>>>>> Shi
>>>>>>>>
>>>>>>>> On 2010-10-12 21:59, M. C. Srivas wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>>> On Tue, Oct 12, 2010 at 4:50 AM, Shi Yu<[email protected]>
>>>>>>>>>>   wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I want to load a serialized HashMap object in hadoop. The file of
>>>>>>>>>>> stored
>>>>>>>>>>> object is 200M. I could read that object efficiently in JAVA by
>>>>>>>>>>> setting
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>> -Xmx
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> as 1000M.  However, in hadoop I could never load it into memory.
>>>>>>>>>>> The
>>>>>>>>>>> code
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> very simple (just read the ObjectInputStream) and there is yet no
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>> map/reduce
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> implemented.  I set the  mapred.child.java.opts=-Xmx3000M, still
>>>>>>>>>>> get
>>>>>>>>>>> the
>>>>>>>>>>> "java.lang.OutOfMemoryError: Java heap space"  Could anyone
>>>>>>>>>>> explain
>>>>>>>>>>> a
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>> little
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> bit how memory is allocate to JVM in hadoop. Why hadoop takes up
>>>>>>>>>>> so
>>>>>>>>>>> much
>>>>>>>>>>> memory?  If a program requires 1G memory on a single node, how
>>>>>>>>>>> much
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>> memory
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> it requires (generally) in Hadoop?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>> The JVM reserves swap space in advance, at the time of launching the
>>>>>>>>> process. If your swap is too low (or do not have any swap
>>>>>>>>> configured),
>>>>>>>>> you
>>>>>>>>> will hit this.
>>>>>>>>>
>>>>>>>>> Or, you are on a 32-bit machine, in which case 3G is not possible in
>>>>>>>>> the
>>>>>>>>> JVM.
>>>>>>>>>
>>>>>>>>> -Srivas.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> Shi
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>> --
>>>>>> Postdoctoral Scholar
>>>>>> Institute for Genomics and Systems Biology
>>>>>> Department of Medicine, the University of Chicago
>>>>>> Knapp Center for Biomedical Discovery
>>>>>> 900 E. 57th St. Room 10148
>>>>>> Chicago, IL 60637, US
>>>>>> Tel: 773-702-6799
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>> --
>>>> Postdoctoral Scholar
>>>> Institute for Genomics and Systems Biology
>>>> Department of Medicine, the University of Chicago
>>>> Knapp Center for Biomedical Discovery
>>>> 900 E. 57th St. Room 10148
>>>> Chicago, IL 60637, US
>>>> Tel: 773-702-6799
>>>>
>>>>
>>>>
>>>>          
>>
>>
>>

Re: load a serialized object in hadoop

Reply via email to