Re: PageRankBenchmrk fails due to IllegalStateException

Inci Cetindil Fri, 02 Dec 2011 23:48:24 -0800

Hi Avery,

I finally succeeded running the benchmark. The problem was not the port; but 
the IP resolving.


After removing the mapping from 127.0.0.1  to the node names on /etc/hosts 
files, it worked like a charm!  I guess Hadoop has different code path to get 
what IP it should listen on; so normal Hadoop jobs worked with the previous 
network configuration.

Thanks for your help!
Inci

On Dec 2, 2011, at 11:06 AM, Avery Ching wrote:

> You can actually set the starting RPC port to change it from 30000 by adding 
> the appropriate configuration  (i.e. hadoop jar 
> giraph-0.70-jar-with-dependencies.jar 
> org.apache.giraph.benchmark.PageRankBenchmark  -Dgiraph.rpcInitialPort=<your 
> starting port> -e 1 -s 3 -v -V 500 -w 5).
> 
> I think I would ensure that those ports are open for communication between on 
> node in your cluster to another .  I don't think that anyone else has run 
> into this problem yet...
> 
> Since the job does take some time to fail, you might want to start it up and 
> then try to telnet to its rpc port from another machine in the cluster and 
> see if that succeeds.
> 
> Hope that helps,
> 
> Avery
> 
> On 12/1/11 11:04 PM, Inci Cetindil wrote:
>> I have tried it with various numbers of workers and it only worked with 1 
>> worker.
>> 
>> I am not running multiple Giraph jobs at the same time, does it always use 
>> the ports 30000 and up? I checked the used ports using "netstat" command and 
>> didn't see any of the ports 30000-30005.
>> 
>> Inci
>> 
>> On Dec 1, 2011, at 7:03 PM, Avery Ching wrote:
>> 
>>> Hmmm...this is unusual.  I wonder if it is tired to the weird number of 
>>> tasks you are getting.  Can you try it with various numbers of workers 
>>> (i.e. 1, 2) and see if it works?
>>> 
>>> To me, the connection refused error indicates that perhaps the server 
>>> failed to bind to its port (are you running multiple Giraph jobs at the 
>>> same time) or the server died?
>>> 
>>> Avery
>>> 
>>> On 12/1/11 5:33 PM, Inci Cetindil wrote:
>>>> I am sure the machines can communicate to each other and the ports are not 
>>>> blocked. I can run word count hadoop job without any problem on these 
>>>> machines. My hadoop version is 0.20.203.0.
>>>> 
>>>> Thanks,
>>>> Inci
>>>> 
>>>> On Dec 1, 2011, at 3:57 PM, Avery Ching wrote:
>>>> 
>>>>> Thanks for the logs.  I see a lot of issues like the following:
>>>>> 
>>>>> 2011-12-01 00:04:46,241 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 0 
>>>>> time(s).
>>>>> 2011-12-01 00:04:47,243 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 1 
>>>>> time(s).
>>>>> 2011-12-01 00:04:48,245 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 2 
>>>>> time(s).
>>>>> 2011-12-01 00:04:49,247 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 3 
>>>>> time(s).
>>>>> 2011-12-01 00:04:50,249 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 4 
>>>>> time(s).
>>>>> 2011-12-01 00:04:51,251 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 5 
>>>>> time(s).
>>>>> 2011-12-01 00:04:52,253 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 6 
>>>>> time(s).
>>>>> 2011-12-01 00:04:53,255 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 7 
>>>>> time(s).
>>>>> 2011-12-01 00:04:54,256 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 8 
>>>>> time(s).
>>>>> 2011-12-01 00:04:55,258 INFO org.apache.hadoop.ipc.Client: Retrying 
>>>>> connect to server: rainbow-01/192.168.100.1:30004. Already tried 9 
>>>>> time(s).
>>>>> 2011-12-01 00:04:55,261 WARN 
>>>>> org.apache.giraph.comm.BasicRPCCommunications: connectAllRPCProxys:     
>>>>> Failed on attempt 0 of 5 to connect to 
>>>>> (id=0,cur=Worker(hostname=rainbow-01, MRpartition=4, 
>>>>> port=30004),prev=null,ckpt_file=null)
>>>>> java.net.ConnectException: Call to rainbow-01/192.168.100.1:30004 failed 
>>>>> on connection exception: java.net.ConnectException: Connection refused
>>>>> 
>>>>> Are you sure that your machines can communicate to each other?  Are the 
>>>>> ports 30000 and up blocked?  And you're right, you should have only had 6 
>>>>> tasks.  What version of Hadoop is this on?
>>>>> 
>>>>> Avery
>>>>> 
>>>>> On 12/1/11 2:43 PM, Inci Cetindil wrote:
>>>>>> Hi Avery,
>>>>>> 
>>>>>> I attached the logs for the first attemps. The weird thing is even if I 
>>>>>> specified the number of workers as 5, I had 8 mapper tasks. You can see 
>>>>>> the logs for tasks 6 and 7 failed immediately. Do you have any 
>>>>>> explanation for this behavior? Normally I should have 6 tasks, right?
>>>>>> 
>>>>>> Thanks,
>>>>>> Inci
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Dec 1, 2011, at 11:00 AM, Avery Ching wrote:
>>>>>> 
>>>>>>> Hi Inci,
>>>>>>> 
>>>>>>> I am not sure what's wrong.  I ran the exact same command on a freshly 
>>>>>>> checked version of Graph without any trouble.  Here's my output:
>>>>>>> 
>>>>>>> hadoop jar target/giraph-0.70-jar-with-dependencies.jar 
>>>>>>> org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 500 -w 5
>>>>>>> Using org.apache.giraph.benchmark.PageRankBenchmark$PageRankVertex
>>>>>>> 11/12/01 10:58:05 WARN bsp.BspOutputFormat: checkOutputSpecs: 
>>>>>>> ImmutableOutputCommiter will not check anything
>>>>>>> 11/12/01 10:58:05 INFO mapred.JobClient: Running job: 
>>>>>>> job_201112011054_0003
>>>>>>> 11/12/01 10:58:06 INFO mapred.JobClient:  map 0% reduce 0%
>>>>>>> 11/12/01 10:58:23 INFO mapred.JobClient:  map 16% reduce 0%
>>>>>>> 11/12/01 10:58:35 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient: Job complete: 
>>>>>>> job_201112011054_0003
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient: Counters: 31
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   Job Counters
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=77566
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Total time spent by all 
>>>>>>> reduces waiting after reserving slots (ms)=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Total time spent by all 
>>>>>>> maps waiting after reserving slots (ms)=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Launched map tasks=6
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   Giraph Timers
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Total (milliseconds)=13468
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 3 
>>>>>>> (milliseconds)=41
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Setup (milliseconds)=11691
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Shutdown (milliseconds)=73
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Vertex input superstep 
>>>>>>> (milliseconds)=369
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 0 
>>>>>>> (milliseconds)=674
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 2 
>>>>>>> (milliseconds)=519
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 1 
>>>>>>> (milliseconds)=96
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   Giraph Stats
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate edges=500
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep=4
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Last checkpointed 
>>>>>>> superstep=2
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Current workers=5
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Current master task 
>>>>>>> partition=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Sent messages=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate finished 
>>>>>>> vertices=500
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate vertices=500
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   File Output Format Counters
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Bytes Written=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   FileSystemCounters
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     FILE_BYTES_READ=590
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     HDFS_BYTES_READ=264
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=129240
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=55080
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   File Input Format Counters
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Bytes Read=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   Map-Reduce Framework
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Map input records=6
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Spilled Records=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Map output records=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     SPLIT_RAW_BYTES=264
>>>>>>> 
>>>>>>> 
>>>>>>> Would it be possible to send me the logs from the first attempts for 
>>>>>>> every map task?
>>>>>>> 
>>>>>>> i.e. from
>>>>>>> Task attempt_201111302343_0002_m_000000_0
>>>>>>> Task attempt_201111302343_0002_m_000001_0
>>>>>>> Task attempt_201111302343_0002_m_000002_0
>>>>>>> Task attempt_201111302343_0002_m_000003_0
>>>>>>> Task attempt_201111302343_0002_m_000004_0
>>>>>>> Task attempt_201111302343_0002_m_000005_0
>>>>>>> 
>>>>>>> I think that could help us find the issue.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Avery
>>>>>>> 
>>>>>>> On 12/1/11 1:17 AM, Inci Cetindil wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I'm running PageRank benchmark example on a cluster with 1 master + 5 
>>>>>>>> slave nodes. I have tried it with a large number of vertices; when I 
>>>>>>>> failed I decided to make it run with 500 vertices and 5 workers first. 
>>>>>>>>  However, it doesn't work even for 500 vertices.
>>>>>>>> I am using the latest version of Giraph from the trunk and running the 
>>>>>>>> following command:
>>>>>>>> 
>>>>>>>> hadoop jar giraph-0.70-jar-with-dependencies.jar 
>>>>>>>> org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 500 -w 5
>>>>>>>> 
>>>>>>>> I attached the error message that I am receiving. Please let me know 
>>>>>>>> if I am missing something.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Inci
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>

Re: PageRankBenchmrk fails due to IllegalStateException

Reply via email to