Re: load a serialized object in hadoop

Luke Lu Wed, 13 Oct 2010 17:05:23 -0700

Just took a look at the bin/hadoop of your particular version
(http://svn.apache.org/viewvc/hadoop/common/tags/release-0.19.2/bin/hadoop?revision=796970&view=markup).
It looks like that HADOOP_CLIENT_OPTS doesn't work with the jar
command, which is fixed in later version.


So try HADOOP_OPTS=-Xmx1000M bin/hadoop ... instead. It would work
because it just translates to the same java command line that worked
for you :)

__Luke

On Wed, Oct 13, 2010 at 4:18 PM, Shi Yu <[email protected]> wrote:
> Hi, I tried the following five ways:
>
> Approach 1: in command line
> HADOOP_CLIENT_OPTS=-Xmx4000m bin/hadoop jar WordCount.jar OOloadtest
>
>
> Approach 2: I added the hadoop-site.xml file with the following element.
> Each time I changed, I stop and restart hadoop on all the nodes.
> ...
> <property>
> <name>HADOOP_CLIENT_OPTS</name>
> <value>-Xmx4000m</value>
> </property>
>
> run the command
> $bin/hadoop jar WordCount.jar OOloadtest
>
> Approach 3: I changed like this
> ...
> <property>
> <name>HADOOP_CLIENT_OPTS</name>
> <value>4000m</value>
> </property>
> ....
>
> Then run the command:
> $bin/hadoop jar WordCount.jar OOloadtest
>
> Approach 4: To make sure, I changed the "m" to numbers, that was
> ...
> <property>
> <name>HADOOP_CLIENT_OPTS</name>
> <value>4000000000</value>
> </property>
> ....
>
> Then run the command:
> $bin/hadoop jar WordCount.jar OOloadtest
>
> All these four approaches come to the same "Java heap space" error.
>
> java.lang.OutOfMemoryError: Java heap space
>        at
> java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
>        at java.lang.StringBuilder.<init>(StringBuilder.java:68)
>        at
> java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:2997)
>        at
> java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2818)
>        at java.io.ObjectInputStream.readString(ObjectInputStream.java:1599)
>        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1320)
>        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
>        at java.util.HashMap.readObject(HashMap.java:1028)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>        at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1846)
>        at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
>        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
>        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
>        at ObjectManager.loadObject(ObjectManager.java:42)
>        at OOloadtest.main(OOloadtest.java:21)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>
> Approach 5:
> In comparison, I called the Java command directly as follows (there is a
> counter showing how much time it costs if the serialized object is
> successfully loaded):
>
> $java -Xms3G -Xmx3G -classpath
> .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar OOloadtest
>
> return:
> object loaded, timing (hms): 0 hour(s) 1 minute(s) 12 second(s)
> 162millisecond(s)
>
>
> What was the problem in my command? Where can I find the documentation about
> HADOOP_CLIENT_OPTS? Have you tried the same thing and found it works?
>
> Shi
>
>
> On 2010-10-13 16:28, Luke Lu wrote:
>>
>> On Wed, Oct 13, 2010 at 2:21 PM, Shi Yu<[email protected]>  wrote:
>>
>>>
>>> Hi,  thanks for the advice. I tried with your settings,
>>> $ bin/hadoop jar Test.jar OOloadtest -D HADOOP_CLIENT_OPTS=-Xmx4000m
>>> still no effect. Or this is a system variable? Should I export it? How to
>>> configure it?
>>>
>>
>> HADOOP_CLIENT_OPTS is an environment variable so you should run it as
>> HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop jar Test.jar OOloadtest
>>
>> if you use sh derivative shells (bash, ksh etc.) prepend env for other
>> shells.
>>
>> __Luke
>>
>>
>>
>>>
>>> Shi
>>>
>>>  java -Xms3G -Xmx3G -classpath
>>>
>>> .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar:lib/commons-collections-3.2.1.jar:lib/stanford-postagger-2010-05-26.jar
>>> OOloadtest
>>>
>>>
>>> On 2010-10-13 15:28, Luke Lu wrote:
>>>
>>>>
>>>> On Wed, Oct 13, 2010 at 12:27 PM, Shi Yu<[email protected]>    wrote:
>>>>
>>>>
>>>>>
>>>>> I haven't implemented anything in map/reduce yet for this issue. I just
>>>>> try
>>>>> to invoke the same java class using   bin/hadoop  command.  The thing
>>>>> is
>>>>> a
>>>>> very simple program could be executed in Java, but not doable in
>>>>> bin/hadoop
>>>>> command.
>>>>>
>>>>>
>>>>
>>>> If you are just trying to use bin/hadoop jar your.jar command, your
>>>> code runs in a local client jvm and mapred.child.java.opts has no
>>>> effect. You should run it with HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop
>>>> jar your.jar
>>>>
>>>>
>>>>
>>>>>
>>>>> I think if I couldn't get through the first stage, even I had a
>>>>> map/reduce program it would also fail. I am using Hadoop 0.19.2.
>>>>> Thanks.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Shi
>>>>>
>>>>> On 2010-10-13 14:15, Luke Lu wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Can you post your mapper/reducer implementation? or are you using
>>>>>> hadoop streaming? for which mapred.child.java.opts doesn't apply to
>>>>>> the jvm you care about. BTW, what's the hadoop version you're using?
>>>>>>
>>>>>> On Wed, Oct 13, 2010 at 11:45 AM, Shi Yu<[email protected]>
>>>>>>  wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Here is my code. There is no Map/Reduce in it. I could run this code
>>>>>>> using
>>>>>>> java -Xmx1000m ,  however, when using  bin/hadoop  -D
>>>>>>> mapred.child.java.opts=-Xmx3000M   it has heap space not enough
>>>>>>> error.
>>>>>>>  I
>>>>>>> have tried other program in Hadoop with the same settings so the
>>>>>>> memory
>>>>>>> is
>>>>>>> available in my machines.
>>>>>>>
>>>>>>>
>>>>>>> public static void main(String[] args) {
>>>>>>>   try{
>>>>>>>             String myFile = "xxx.dat";
>>>>>>>             FileInputStream fin = new FileInputStream(myFile);
>>>>>>>             ois = new ObjectInputStream(fin);
>>>>>>>             margintagMap = ois.readObject();
>>>>>>>             ois.close();
>>>>>>>             fin.close();
>>>>>>>     }catch(Exception e){
>>>>>>>         //
>>>>>>>    }
>>>>>>> }
>>>>>>>
>>>>>>> On 2010-10-13 13:30, Luke Lu wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Oct 13, 2010 at 8:04 AM, Shi Yu<[email protected]>
>>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> As a coming-up to the my own question, I think to invoke the JVM in
>>>>>>>>> Hadoop
>>>>>>>>> requires much more memory than an ordinary JVM.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> That's simply not true. The default mapreduce task Xmx is 200M,
>>>>>>>> which
>>>>>>>> is much smaller than the standard jvm default 512M and most users
>>>>>>>> don't need to increase it. Please post the code reading the object
>>>>>>>> (in
>>>>>>>> hdfs?) in your tasks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I found that instead of
>>>>>>>>> serialization the object, maybe I could create a MapFile as an
>>>>>>>>> index
>>>>>>>>> to
>>>>>>>>> permit lookups by key in Hadoop. I have also compared the
>>>>>>>>> performance
>>>>>>>>> of
>>>>>>>>> MongoDB and Memcache. I will let you know the result after I try
>>>>>>>>> the
>>>>>>>>> MapFile
>>>>>>>>> approach.
>>>>>>>>>
>>>>>>>>> Shi
>>>>>>>>>
>>>>>>>>> On 2010-10-12 21:59, M. C. Srivas wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 12, 2010 at 4:50 AM, Shi Yu<[email protected]>
>>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I want to load a serialized HashMap object in hadoop. The file
>>>>>>>>>>>> of
>>>>>>>>>>>> stored
>>>>>>>>>>>> object is 200M. I could read that object efficiently in JAVA by
>>>>>>>>>>>> setting
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Xmx
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> as 1000M.  However, in hadoop I could never load it into memory.
>>>>>>>>>>>> The
>>>>>>>>>>>> code
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> very simple (just read the ObjectInputStream) and there is yet
>>>>>>>>>>>> no
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> map/reduce
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> implemented.  I set the  mapred.child.java.opts=-Xmx3000M, still
>>>>>>>>>>>> get
>>>>>>>>>>>> the
>>>>>>>>>>>> "java.lang.OutOfMemoryError: Java heap space"  Could anyone
>>>>>>>>>>>> explain
>>>>>>>>>>>> a
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> little
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> bit how memory is allocate to JVM in hadoop. Why hadoop takes up
>>>>>>>>>>>> so
>>>>>>>>>>>> much
>>>>>>>>>>>> memory?  If a program requires 1G memory on a single node, how
>>>>>>>>>>>> much
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> memory
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> it requires (generally) in Hadoop?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The JVM reserves swap space in advance, at the time of launching
>>>>>>>>>> the
>>>>>>>>>> process. If your swap is too low (or do not have any swap
>>>>>>>>>> configured),
>>>>>>>>>> you
>>>>>>>>>> will hit this.
>>>>>>>>>>
>>>>>>>>>> Or, you are on a 32-bit machine, in which case 3G is not possible
>>>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>> JVM.
>>>>>>>>>>
>>>>>>>>>> -Srivas.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> Shi
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Postdoctoral Scholar
>>>>>>> Institute for Genomics and Systems Biology
>>>>>>> Department of Medicine, the University of Chicago
>>>>>>> Knapp Center for Biomedical Discovery
>>>>>>> 900 E. 57th St. Room 10148
>>>>>>> Chicago, IL 60637, US
>>>>>>> Tel: 773-702-6799
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Postdoctoral Scholar
>>>>> Institute for Genomics and Systems Biology
>>>>> Department of Medicine, the University of Chicago
>>>>> Knapp Center for Biomedical Discovery
>>>>> 900 E. 57th St. Room 10148
>>>>> Chicago, IL 60637, US
>>>>> Tel: 773-702-6799
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
>

Re: load a serialized object in hadoop

Reply via email to