date:20100617

[jira] Created: (HDFS-1239) All datanodes are bad in 2nd phase

2010-06-17 Thread Thanh Do (JIRA)

All datanodes are bad in 2nd phase
--

 Key: HDFS-1239
 URL: https://issues.apache.org/jira/browse/HDFS-1239
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Setups:
number of datanodes = 2
replication factor = 2
Type of failure: transient fault (a java i/o call throws an exception or return 
false)
Number of failures = 2
when/where failures happen = during the 2nd phase of the pipeline, each happens 
at each datanode when trying to perform I/O 
(e.g. dataoutputstream.flush())
 
- Details:
 
This is similar to HDFS-1237.
In this case, node1 throws exception that makes client creates
a pipeline only with node2, then tries to redo the whole thing,
which throws another failure. So at this point, the client considers
all datanodes are bad, and never retries the whole thing again, 
(i.e. it never asks the namenode again to ask for a new set of datanodes).
In HDFS-1237, the bug is due to permanent disk fault. In this case, it's about 
transient error.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1071) savenamespace should write the fsimage to all configured fs.name.dir in parallel

2010-06-17 Thread dhruba borthakur (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879778#action_12879778
 ] 

dhruba borthakur commented on HDFS-1071:


hi konstantin, it appears that Dmytro's last comment addresses all of your 
questions.

 savenamespace should write the fsimage to all configured fs.name.dir in 
 parallel
 

 Key: HDFS-1071
 URL: https://issues.apache.org/jira/browse/HDFS-1071
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: dhruba borthakur
Assignee: Dmytro Molkov
 Attachments: HDFS-1071.2.patch, HDFS-1071.3.patch, HDFS-1071.4.patch, 
 HDFS-1071.patch


 If you have a large number of files in HDFS, the fsimage file is very big. 
 When the namenode restarts, it writes a copy of the fsimage to all 
 directories configured in fs.name.dir. This takes a long time, especially if 
 there are many directories in fs.name.dir. Make the NN write the fsimage to 
 all these directories in parallel.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-947) The namenode should redirect a hftp request to read a file to the datanode that has the maximum number of local replicas

2010-06-17 Thread dhruba borthakur (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879779#action_12879779
 ] 

dhruba borthakur commented on HDFS-947:
---

+1 code looks good to me. 

 The namenode should redirect a hftp request to read a file to the datanode 
 that has the maximum number of local replicas
 

 Key: HDFS-947
 URL: https://issues.apache.org/jira/browse/HDFS-947
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: dhruba borthakur
Assignee: Dmytro Molkov
 Attachments: HDFS-947.2.patch, HDFS-947.patch, hftpRedirection.patch


 A client that uses the Hftp protocol to read a file is redirected by the 
 namenode to a random datanode. It would be nice if the client gets redirected 
 to a datanode that has the maximum number of local replicas of the blocks of 
 the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-599) Improve Namenode robustness by prioritizing datanode heartbeats over client requests

2010-06-17 Thread dhruba borthakur (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879782#action_12879782
]

dhruba borthakur commented on HDFS-599:
---

dmytro: can you pl run the Hudson tests manually and post the results here?
Thanks.

Improve Namenode robustness by prioritizing datanode heartbeats over client
requests

Key: HDFS-599
URL: https://issues.apache.org/jira/browse/HDFS-599
Project: Hadoop HDFS
Issue Type: Improvement
Components: name-node
Reporter: dhruba borthakur
Assignee: Dmytro Molkov
Fix For: 0.22.0

Attachments: HDFS-599.3.patch, HDFS-599.patch

The namenode processes RPC requests from clients that are reading/writing to
files as well as heartbeats/block reports from datanodes.
Sometime, because of various reasons (Java GC runs, inconsistent performance
of NFS filer that stores HDFS transacttion logs, etc), the namenode
encounters transient slowness. For example, if the device that stores the
HDFS transaction logs becomes sluggish, the Namenode's ability to process
RPCs slows down to a certain extent. During this time, the RPCs from clients
as well as the RPCs from datanodes suffer in similar fashion. If the
underlying problem becomes worse, the NN's ability to process a heartbeat
from a DN is severly impacted, thus causing the NN to declare that the DN is
dead. Then the NN starts replicating blocks that used to reside on the
now-declared-dead datanode. This adds extra load to the NN. Then the
now-declared-datanode finally re-establishes contact with the NN, and sends a
block report. The block report processing on the NN is another heavyweight
activity, thus casing more load to the already overloaded namenode.
My proposal is tha the NN should try its best to continue processing RPCs
from datanodes and give lesser priority to serving client requests. The
Datanode RPCs are integral to the consistency and performance of the Hadoop
file system, and it is better to protect it at all costs. This will ensure
that NN recovers from the hiccup much faster than what it does now.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1239) All datanodes are bad in 2nd phase

2010-06-17 Thread dhruba borthakur (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879786#action_12879786
]

dhruba borthakur commented on HDFS-1239:

if a client has written some data to a set of replicas for that block and then
all the replicas go bad, then the client gets an IO error and stops writing any
more data to that file.

what is ur proposed fix? can you pl explain, thanks.

All datanodes are bad in 2nd phase
--

Key: HDFS-1239
URL: https://issues.apache.org/jira/browse/HDFS-1239
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do

- Setups:
number of datanodes = 2
replication factor = 2
Type of failure: transient fault (a java i/o call throws an exception or
return false)
Number of failures = 2
when/where failures happen = during the 2nd phase of the pipeline, each
happens at each datanode when trying to perform I/O
(e.g. dataoutputstream.flush())

- Details:

This is similar to HDFS-1237.
In this case, node1 throws exception that makes client creates
a pipeline only with node2, then tries to redo the whole thing,
which throws another failure. So at this point, the client considers
all datanodes are bad, and never retries the whole thing again,
(i.e. it never asks the namenode again to ask for a new set of datanodes).
In HDFS-1237, the bug is due to permanent disk fault. In this case, it's
about transient error.
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table

2010-06-17 Thread Tsz Wo (Nicholas), SZE (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879836#action_12879836
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1114:
--

 What about -XX:+UseCompressedOops ? 

This is a good point.  Is there a way to determine if UseCompressedOops is set 
in runtime?

 Reducing NameNode memory usage by an alternate hash table
 -

 Key: HDFS-1114
 URL: https://issues.apache.org/jira/browse/HDFS-1114
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
 Attachments: GSet20100525.pdf, gset20100608.pdf, 
 h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch, 
 h1114_20100616b.patch


 NameNode uses a java.util.HashMap to store BlockInfo objects.  When there are 
 many blocks in HDFS, this map uses a lot of memory in the NameNode.  We may 
 optimize the memory usage by a light weight hash table implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table

2010-06-17 Thread Tsz Wo (Nicholas), SZE (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-1114:
-

Attachment: h1114_20100617.patch

h1114_20100617.patch: the UnsupportedOperationException thrown in put(..) 
should be NullPointerException.

 Reducing NameNode memory usage by an alternate hash table
 -

 Key: HDFS-1114
 URL: https://issues.apache.org/jira/browse/HDFS-1114
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
 Attachments: GSet20100525.pdf, gset20100608.pdf, 
 h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch, 
 h1114_20100616b.patch, h1114_20100617.patch


 NameNode uses a java.util.HashMap to store BlockInfo objects.  When there are 
 many blocks in HDFS, this map uses a lot of memory in the NameNode.  We may 
 optimize the memory usage by a light weight hash table implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

53 matches

Mail list logo