[jira] [Updated] (HDFS-2077) 1073: address checkpoint upload when one of the storage dirs is failed
[ https://issues.apache.org/jira/browse/HDFS-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2077: -- Attachment: hdfs-2077.txt Updated patch to fix the above dumb bug. > 1073: address checkpoint upload when one of the storage dirs is failed > -- > > Key: HDFS-2077 > URL: https://issues.apache.org/jira/browse/HDFS-2077 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2077.txt, hdfs-2077.txt > > > This JIRA addresses the following case: > - NN is running with 2 storage dirs > - 1 of the dirs fails > - 2NN makes a checkpoint > Currently, if GetImageServlet fails to open _any_ of the local files to > receive a checkpoint, it will fail the entire checkpoint upload process. > Instead, it should continue to receive checkpoints in the non-failed > directories. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2080) Speed up DFS read path
[ https://issues.apache.org/jira/browse/HDFS-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2080: -- Attachment: hdfs-2080.txt Updated patch to fix a couple of the issues mentioned above: - fix a couple of tests which used {{new Socket}} directly instead of the {{SocketFactory}} -- thus they didn't have associated Channels and BlockReader failed - fix blockreader to handle EOF correctly (fixes TestClientBlockVerification) - fix TestSeekBug to use readFully where necessary the append-related bug still exists, but this patch should be useful enough for some people to play around with this if interested. > Speed up DFS read path > -- > > Key: HDFS-2080 > URL: https://issues.apache.org/jira/browse/HDFS-2080 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.23.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.23.0 > > Attachments: hdfs-2080.txt, hdfs-2080.txt > > > I've developed a series of patches that speeds up the HDFS read path by a > factor of about 2.5x (~300M/sec to ~800M/sec for localhost reading from > buffer cache) and also will make it easier to allow for advanced users (eg > hbase) to skip a buffer copy. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2093) 1073: Handle case where an entirely empty log is left during NN crash
[ https://issues.apache.org/jira/browse/HDFS-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2093: -- Attachment: hdfs-2093.txt Updated patch addresses the case where the above happens and there is only on storage dir. I took the conservative route and consider the single empty segment corrupt since it's a very rare failure, more likely to occur due to drive corruption than a well-timed crash. > 1073: Handle case where an entirely empty log is left during NN crash > - > > Key: HDFS-2093 > URL: https://issues.apache.org/jira/browse/HDFS-2093 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2093.txt, hdfs-2093.txt > > > In fault-testing the HDFS-1073 branch, I saw the following situation: > - NN has two storage directories, but one is in failed state > - NN starts to roll edits logs to edits_inprogress_5160285 > - NN then crashes > - on restart, it detects the truncated log, but since it has 0 txns, it > finalizes it to the nonsense log name edits_5160285-5160284. > - It then starts logs again at edits_inprogress_5160285. > - After this point, no checkpoints or future NN startups succeed since there > are two logs starting with the same txid -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2084) Sometimes backup node/secondary name node stops with exception
[ https://issues.apache.org/jira/browse/HDFS-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052360#comment-13052360 ] Vitalii Tymchyshyn commented on HDFS-2084: -- I dont know how does it happen, but it does often for me. I have a copy of namenode state that produces this problem on start I can share. In general, as for me, it would be great to have an option to perform a checkpoint skipping invalid records of any kind. Because now any such record make namenode unusable on restart and with an option you will at most loose a bit of information. > Sometimes backup node/secondary name node stops with exception > -- > > Key: HDFS-2084 > URL: https://issues.apache.org/jira/browse/HDFS-2084 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.21.0 > Environment: FreeBSD >Reporter: Vitalii Tymchyshyn > Attachments: patch.diff > > > 2011-06-17 11:43:23,096 ERROR > org.apache.hadoop.hdfs.server.namenode.Checkpointer: Throwable Exception in > doCheckpoint: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209) > at > org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2093) 1073: Handle case where an entirely empty log is left during NN crash
[ https://issues.apache.org/jira/browse/HDFS-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2093: -- Attachment: hdfs-2093.txt Attached patch considers such logs as corrupt at startup time. Thus in the situation above, where the only log we have is this corrupted one, it will refuse to let the NN start, with a nice message explaining that the logs starting at this txid are corrupt with no txns. The operator can then double-check whether a different storage drive which possibly went missing might have better logs, etc, before starting NN. > 1073: Handle case where an entirely empty log is left during NN crash > - > > Key: HDFS-2093 > URL: https://issues.apache.org/jira/browse/HDFS-2093 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2093.txt > > > In fault-testing the HDFS-1073 branch, I saw the following situation: > - NN has two storage directories, but one is in failed state > - NN starts to roll edits logs to edits_inprogress_5160285 > - NN then crashes > - on restart, it detects the truncated log, but since it has 0 txns, it > finalizes it to the nonsense log name edits_5160285-5160284. > - It then starts logs again at edits_inprogress_5160285. > - After this point, no checkpoints or future NN startups succeed since there > are two logs starting with the same txid -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2093) 1073: Handle case where an entirely empty log is left during NN crash
1073: Handle case where an entirely empty log is left during NN crash - Key: HDFS-2093 URL: https://issues.apache.org/jira/browse/HDFS-2093 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Todd Lipcon Assignee: Todd Lipcon -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2093) 1073: Handle case where an entirely empty log is left during NN crash
[ https://issues.apache.org/jira/browse/HDFS-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052347#comment-13052347 ] Todd Lipcon commented on HDFS-2093: --- The really good news here is that the robustness of this design even in the presence of bugs proved itself - I removed the nonsense log file and the NN started with no corruption, 2NN checkpointed happily, no data lost. > 1073: Handle case where an entirely empty log is left during NN crash > - > > Key: HDFS-2093 > URL: https://issues.apache.org/jira/browse/HDFS-2093 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > > In fault-testing the HDFS-1073 branch, I saw the following situation: > - NN has two storage directories, but one is in failed state > - NN starts to roll edits logs to edits_inprogress_5160285 > - NN then crashes > - on restart, it detects the truncated log, but since it has 0 txns, it > finalizes it to the nonsense log name edits_5160285-5160284. > - It then starts logs again at edits_inprogress_5160285. > - After this point, no checkpoints or future NN startups succeed since there > are two logs starting with the same txid -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2093) 1073: Handle case where an entirely empty log is left during NN crash
[ https://issues.apache.org/jira/browse/HDFS-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2093: -- Component/s: name-node Description: In fault-testing the HDFS-1073 branch, I saw the following situation: - NN has two storage directories, but one is in failed state - NN starts to roll edits logs to edits_inprogress_5160285 - NN then crashes - on restart, it detects the truncated log, but since it has 0 txns, it finalizes it to the nonsense log name edits_5160285-5160284. - It then starts logs again at edits_inprogress_5160285. - After this point, no checkpoints or future NN startups succeed since there are two logs starting with the same txid Affects Version/s: Edit log branch (HDFS-1073) Fix Version/s: Edit log branch (HDFS-1073) > 1073: Handle case where an entirely empty log is left during NN crash > - > > Key: HDFS-2093 > URL: https://issues.apache.org/jira/browse/HDFS-2093 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > > In fault-testing the HDFS-1073 branch, I saw the following situation: > - NN has two storage directories, but one is in failed state > - NN starts to roll edits logs to edits_inprogress_5160285 > - NN then crashes > - on restart, it detects the truncated log, but since it has 0 txns, it > finalizes it to the nonsense log name edits_5160285-5160284. > - It then starts logs again at edits_inprogress_5160285. > - After this point, no checkpoints or future NN startups succeed since there > are two logs starting with the same txid -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2094) Add metrics for write pipeline failures
Add metrics for write pipeline failures --- Key: HDFS-2094 URL: https://issues.apache.org/jira/browse/HDFS-2094 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.0 Reporter: Bharath Mundlapudi Assignee: Bharath Mundlapudi Fix For: 0.23.0 Write pipeline can fail for various reasons like rpc connection issues, disk problem etc. I am proposing to add metrics to detect write pipeline issues. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2092) Remove configuration object reference in DFSClient
Remove configuration object reference in DFSClient -- Key: HDFS-2092 URL: https://issues.apache.org/jira/browse/HDFS-2092 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.23.0 Reporter: Bharath Mundlapudi Assignee: Bharath Mundlapudi Fix For: 0.23.0 At present, DFSClient stores reference to configuration object. Since, these configuration objects are pretty big at times can blot the processes which has multiple DFSClient objects like in TaskTracker. This is an attempt to remove the reference of conf object in DFSClient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HDFS-2091) Hadoop does not scale as expected
[ https://issues.apache.org/jira/browse/HDFS-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved HDFS-2091. --- Resolution: Invalid Hi Alberto. This is the bug tracker rather than a place for questions. You might try the mapreduce-user mailing list. > Hadoop does not scale as expected > - > > Key: HDFS-2091 > URL: https://issues.apache.org/jira/browse/HDFS-2091 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Linux, 8 nodes. >Reporter: Alberto Andreotti > Original Estimate: 504h > Remaining Estimate: 504h > > The more nodes I add to this application, the slower it goes. This is the > app's map, > public void map(IntWritable linearPos, FloatWritable heat, Context context > ) throws IOException, InterruptedException { >int myLinearPos = linearPos.get(); >//Distribute my value to the previous and the next >linearPos.set(myLinearPos - 1); >context.write(linearPos, heat); >linearPos.set(myLinearPos + 1); >context.write(linearPos, heat); >//Distribute my value to the cells above and below >linearPos.set(myLinearPos - MatrixData.Length()); >context.write(linearPos, heat); >linearPos.set(myLinearPos + MatrixData.Length()); >context.write(linearPos, heat); > }//end map > and this is the reduce, > public void reduce(IntWritable linearPos, Iterable fwValues, > Context context) throws IOException, > InterruptedException { >//Handle first and last "cold" boundaries >if(linearPos.get()<0 || linearPos.get()>MatrixData.LinearSize()){ > return; >} >if(linearPos.get()==MatrixData.HeatSourceLinearPos()){ > context.write(linearPos, new > FloatWritable(MatrixData.HeatSourceTemperature())); > return; >} >float result = 0.0f; >//Add all the values >for(FloatWritable heat : fwValues) { > result += heat.get(); >} > context.write(linearPos, new FloatWritable(result/4) ); > } > For example, with 6 nodes I get a running time of 15minutes, and with 4 nodes > I get a running time of 8minutes!. > This is how I generated the input, > public static void main(String[] args) throws IOException { > //Write file in the local dir > String uri = "/home/beto/mySeq"; > Configuration conf = new Configuration(); > FileSystem fs = FileSystem.get(URI.create(uri), conf); > Path path = new Path(uri); > IntWritable key = new IntWritable(); > FloatWritable value = new FloatWritable(0.0f); > SequenceFile.Writer writer = null; > try { >writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), > value.getClass()); > int step = MatrixData.LinearSize()/10; > int limit = step; > for (int i = 0; i <= MatrixData.LinearSize(); i++) { > key.set(i); > if(i>limit){ > System.out.println("*"); > limit +=step; > } > if(i==MatrixData.HeatSourceLinearPos()) { > writer.append(key, new > FloatWritable(MatrixData.HeatSourceTemperature())); > continue; > } > writer.append(key, value); > } > } finally { > IOUtils.closeStream(writer); > } > } > I'm basically solving a heat transfer problem in a squared section. Pretty > simple. The input data is being stored as a (key, value) pairs, read in this > way, processed, and written again in the same format. > Any thoughts? > Alberto. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2091) Hadoop does not scale as expected
[ https://issues.apache.org/jira/browse/HDFS-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto Andreotti updated HDFS-2091: Description: The more nodes I add to this application, the slower it goes. This is the app's map, public void map(IntWritable linearPos, FloatWritable heat, Context context ) throws IOException, InterruptedException { int myLinearPos = linearPos.get(); //Distribute my value to the previous and the next linearPos.set(myLinearPos - 1); context.write(linearPos, heat); linearPos.set(myLinearPos + 1); context.write(linearPos, heat); //Distribute my value to the cells above and below linearPos.set(myLinearPos - MatrixData.Length()); context.write(linearPos, heat); linearPos.set(myLinearPos + MatrixData.Length()); context.write(linearPos, heat); }//end map and this is the reduce, public void reduce(IntWritable linearPos, Iterable fwValues, Context context) throws IOException, InterruptedException { //Handle first and last "cold" boundaries if(linearPos.get()<0 || linearPos.get()>MatrixData.LinearSize()){ return; } if(linearPos.get()==MatrixData.HeatSourceLinearPos()){ context.write(linearPos, new FloatWritable(MatrixData.HeatSourceTemperature())); return; } float result = 0.0f; //Add all the values for(FloatWritable heat : fwValues) { result += heat.get(); } context.write(linearPos, new FloatWritable(result/4) ); } For example, with 6 nodes I get a running time of 15minutes, and with 4 nodes I get a running time of 8minutes!. This is how I generated the input, public static void main(String[] args) throws IOException { //Write file in the local dir String uri = "/home/beto/mySeq"; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); IntWritable key = new IntWritable(); FloatWritable value = new FloatWritable(0.0f); SequenceFile.Writer writer = null; try { writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass()); int step = MatrixData.LinearSize()/10; int limit = step; for (int i = 0; i <= MatrixData.LinearSize(); i++) { key.set(i); if(i>limit){ System.out.println("*"); limit +=step; } if(i==MatrixData.HeatSourceLinearPos()) { writer.append(key, new FloatWritable(MatrixData.HeatSourceTemperature())); continue; } writer.append(key, value); } } finally { IOUtils.closeStream(writer); } } I'm basically solving a heat transfer problem in a squared section. Pretty simple. The input data is being stored as a (key, value) pairs, read in this way, processed, and written again in the same format. Any thoughts? Alberto. was: The more nodes I add to this application, the slower it goes. This is the app's map, public void map(IntWritable linearPos, FloatWritable heat, Context context ) throws IOException, InterruptedException { int myLinearPos = linearPos.get(); //Distribute my value to the previous and the next linearPos.set(myLinearPos - 1); context.write(linearPos, heat); linearPos.set(myLinearPos + 1); context.write(linearPos, heat); //Distribute my value to the cells above and below linearPos.set(myLinearPos - MatrixData.Length()); context.write(linearPos, heat); linearPos.set(myLinearPos + MatrixData.Length()); context.write(linearPos, heat); }//end map and this is the reduce, public void reduce(IntWritable linearPos, Iterable fwValues, Context context) throws IOException, InterruptedException { //Handle first and last "cold" boundaries if(linearPos.get()<0 || linearPos.get()>MatrixData.LinearSize()){ return; } if(linearPos.get()==MatrixData.HeatSourceLinearPos()){ context.write(linearPos, new FloatWritable(MatrixData.HeatSourceTemperature())); return; } float result = 0.0f; //Add all the values for(FloatWritable heat : fwValues) { result += heat.get(); } context.write(linearPos, new FloatWritable(result/4) ); } For example, with 6 nodes I get a running time of 15minutes, and with 4 nodes I get a running time of 8minutes!. This is how I generated the input, public static void main(String[] args) throws IOException { //Write file in the local dir String uri = "/home/beto/mySeq"; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Pa
[jira] [Created] (HDFS-2091) Hadoop does not scale as expected
Hadoop does not scale as expected - Key: HDFS-2091 URL: https://issues.apache.org/jira/browse/HDFS-2091 Project: Hadoop HDFS Issue Type: Bug Environment: Linux, 8 nodes. Reporter: Alberto Andreotti The more nodes I add to this application, the slower it goes. This is the app's map, public void map(IntWritable linearPos, FloatWritable heat, Context context ) throws IOException, InterruptedException { int myLinearPos = linearPos.get(); //Distribute my value to the previous and the next linearPos.set(myLinearPos - 1); context.write(linearPos, heat); linearPos.set(myLinearPos + 1); context.write(linearPos, heat); //Distribute my value to the cells above and below linearPos.set(myLinearPos - MatrixData.Length()); context.write(linearPos, heat); linearPos.set(myLinearPos + MatrixData.Length()); context.write(linearPos, heat); }//end map and this is the reduce, public void reduce(IntWritable linearPos, Iterable fwValues, Context context) throws IOException, InterruptedException { //Handle first and last "cold" boundaries if(linearPos.get()<0 || linearPos.get()>MatrixData.LinearSize()){ return; } if(linearPos.get()==MatrixData.HeatSourceLinearPos()){ context.write(linearPos, new FloatWritable(MatrixData.HeatSourceTemperature())); return; } float result = 0.0f; //Add all the values for(FloatWritable heat : fwValues) { result += heat.get(); } context.write(linearPos, new FloatWritable(result/4) ); } For example, with 6 nodes I get a running time of 15minutes, and with 4 nodes I get a running time of 8minutes!. This is how I generated the input, public static void main(String[] args) throws IOException { //Write file in the local dir String uri = "/home/beto/mySeq"; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); IntWritable key = new IntWritable(); FloatWritable value = new FloatWritable(0.0f); SequenceFile.Writer writer = null; try { writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass()); int step = MatrixData.LinearSize()/10; int limit = step; for (int i = 0; i <= MatrixData.LinearSize(); i++) { key.set(i); if(i>limit){ System.out.println("*"); limit +=step; } if(i==MatrixData.HeatSourceLinearPos()) { writer.append(key, new FloatWritable(MatrixData.HeatSourceTemperature())); continue; } writer.append(key, value); } } finally { IOUtils.closeStream(writer); } } I'm basically solving a heat transfer problem in a squared section. Pretty simple. The input data is being stored as a (key, value) pairs, read in this way, processed, and written again in the same format. Any thoughts? Alberto. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1568) Improve DataXceiver error logging
[ https://issues.apache.org/jira/browse/HDFS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052339#comment-13052339 ] Hadoop QA commented on HDFS-1568: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483041/HDFS-1568-6.patch against trunk revision 1137675. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/803//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/803//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/803//console This message is automatically generated. > Improve DataXceiver error logging > - > > Key: HDFS-1568 > URL: https://issues.apache.org/jira/browse/HDFS-1568 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node >Affects Versions: 0.23.0 >Reporter: Todd Lipcon >Assignee: Joey Echeverria >Priority: Minor > Labels: newbie > Attachments: HDFS-1568-1.patch, HDFS-1568-3.patch, HDFS-1568-4.patch, > HDFS-1568-5.patch, HDFS-1568-6.patch, HDFS-1568-output-changes.patch > > > In supporting customers we often see things like SocketTimeoutExceptions or > EOFExceptions coming from DataXceiver, but the logging isn't very good. For > example, if we get an IOE while setting up a connection to the downstream > mirror in writeBlock, the IP of the downstream mirror isn't logged on the DN > side. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2090) BackupNode fails when log is streamed due checksum error
[ https://issues.apache.org/jira/browse/HDFS-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052322#comment-13052322 ] André Oriani commented on HDFS-2090: According to my investigation and the help of Ivan Kelly from Yahoo, the commit below has introduced the bug: {panel:borderStyle=solid} Commit 27b956fa62ce9b467ab7dd287dd6dcd5ab6a0cb3 Author: Hairong Kuang Date: Mon Apr 11 17:15:27 2011 + HDFS-1630. Support fsedits checksum. Contrbuted by Hairong Kuang. git-svn-id: https://svn.apache.org/repos/asf/hadoop/hdfs/trunk@109113113f79535-47bb-0310-9956-ffa450edef68 {panel} PS: This is a github commit. > BackupNode fails when log is streamed due checksum error > - > > Key: HDFS-2090 > URL: https://issues.apache.org/jira/browse/HDFS-2090 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: André Oriani > > *Reproductions steps:* > 1) An HDFS cluster is up and running > 2) A backupnode is up, running, and registered to the namenode > 3) Do a write operation like copying a file to the FS. > *Expected Result:* No exception is thrown > *Actual Result:* A exception is thrown due a checksum error in the streamed > log: > {panel:title=log| borderStyle=solid} > 11/06/15 17:52:22 INFO ipc.Server: IPC Server handler 1 on 50100, call > journal(NamenodeRegistration(localhost:8020, role=NameNode), 101, 164, > [B@3951f910), rpc version=1, client version=5, methodsFingerPrint=302283637 > from 192.168.1.102:56780: error: java.io.IOException: Error replaying edit > log at offset 13 > Recent opcode offsets: 1 > java.io.IOException: Error replaying edit log at offset 13 > Recent opcode offsets: 1 > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:514) > at > org.apache.hadoop.hdfs.server.namenode.BackupImage.journal(BackupImage.java:242) > at > org.apache.hadoop.hdfs.server.namenode.BackupNode.journal(BackupNode.java:251) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:422) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1496) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1492) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1490) > Caused by: org.apache.hadoop.fs.ChecksumException: Transaction 1 is corrupt. > Calculated checksum is -2116249809 but read checksum 0 > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.validateChecksum(FSEditLogLoader.java:546) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:490) > ... 13 more > {panel} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2090) BackupNode fails when log is streamed due checksum error
[ https://issues.apache.org/jira/browse/HDFS-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052321#comment-13052321 ] André Oriani commented on HDFS-2090: According to my investigation and the help of Ivan Kelly from Yahoo, the commit below has introduced the bug: {panel:borderStyle=solid} Commit 27b956fa62ce9b467ab7dd287dd6dcd5ab6a0cb3 Author: Hairong Kuang Date: Mon Apr 11 17:15:27 2011 + HDFS-1630. Support fsedits checksum. Contrbuted by Hairong Kuang. git-svn-id: https://svn.apache.org/repos/asf/hadoop/hdfs/trunk@109113113f79535-47bb-0310-9956-ffa450edef68 {panel} PS: This is a github commit. > BackupNode fails when log is streamed due checksum error > - > > Key: HDFS-2090 > URL: https://issues.apache.org/jira/browse/HDFS-2090 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: André Oriani > > *Reproductions steps:* > 1) An HDFS cluster is up and running > 2) A backupnode is up, running, and registered to the namenode > 3) Do a write operation like copying a file to the FS. > *Expected Result:* No exception is thrown > *Actual Result:* A exception is thrown due a checksum error in the streamed > log: > {panel:title=log| borderStyle=solid} > 11/06/15 17:52:22 INFO ipc.Server: IPC Server handler 1 on 50100, call > journal(NamenodeRegistration(localhost:8020, role=NameNode), 101, 164, > [B@3951f910), rpc version=1, client version=5, methodsFingerPrint=302283637 > from 192.168.1.102:56780: error: java.io.IOException: Error replaying edit > log at offset 13 > Recent opcode offsets: 1 > java.io.IOException: Error replaying edit log at offset 13 > Recent opcode offsets: 1 > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:514) > at > org.apache.hadoop.hdfs.server.namenode.BackupImage.journal(BackupImage.java:242) > at > org.apache.hadoop.hdfs.server.namenode.BackupNode.journal(BackupNode.java:251) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:422) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1496) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1492) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1490) > Caused by: org.apache.hadoop.fs.ChecksumException: Transaction 1 is corrupt. > Calculated checksum is -2116249809 but read checksum 0 > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.validateChecksum(FSEditLogLoader.java:546) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:490) > ... 13 more > {panel} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2090) BackupNode fails when log is streamed due checksum error
BackupNode fails when log is streamed due checksum error - Key: HDFS-2090 URL: https://issues.apache.org/jira/browse/HDFS-2090 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: André Oriani *Reproductions steps:* 1) An HDFS cluster is up and running 2) A backupnode is up, running, and registered to the namenode 3) Do a write operation like copying a file to the FS. *Expected Result:* No exception is thrown *Actual Result:* A exception is thrown due a checksum error in the streamed log: {panel:title=log| borderStyle=solid} 11/06/15 17:52:22 INFO ipc.Server: IPC Server handler 1 on 50100, call journal(NamenodeRegistration(localhost:8020, role=NameNode), 101, 164, [B@3951f910), rpc version=1, client version=5, methodsFingerPrint=302283637 from 192.168.1.102:56780: error: java.io.IOException: Error replaying edit log at offset 13 Recent opcode offsets: 1 java.io.IOException: Error replaying edit log at offset 13 Recent opcode offsets: 1 at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:514) at org.apache.hadoop.hdfs.server.namenode.BackupImage.journal(BackupImage.java:242) at org.apache.hadoop.hdfs.server.namenode.BackupNode.journal(BackupNode.java:251) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:422) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1496) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1492) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1490) Caused by: org.apache.hadoop.fs.ChecksumException: Transaction 1 is corrupt. Calculated checksum is -2116249809 but read checksum 0 at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.validateChecksum(FSEditLogLoader.java:546) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:490) ... 13 more {panel} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2086) If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
[ https://issues.apache.org/jira/browse/HDFS-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052314#comment-13052314 ] Hadoop QA commented on HDFS-2086: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483221/HDFS-2086.patch against trunk revision 1137675. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestStartup +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/802//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/802//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/802//console This message is automatically generated. > If the include hosts list contains host name, after restarting namenode, > datanodes registrant is denied > > > Key: HDFS-2086 > URL: https://issues.apache.org/jira/browse/HDFS-2086 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: Tanping Wang >Assignee: Tanping Wang > Fix For: 0.23.0 > > Attachments: HDFS-2086.patch > > > As the title describes the problem: if the include host list contains host > name, after restarting namenodes, the datanodes registrant is denied by > namenodes. This is because after namenode is restarted, the still alive data > node will try to register itself with the namenode and it identifies itself > with its *IP address*. However, namenode only allows all the hosts in its > hosts list to registrant and all of them are hostnames. So namenode would > deny the datanode registration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2087) Add methods to DataTransferProtocol interface
[ https://issues.apache.org/jira/browse/HDFS-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-2087: - Attachment: h2087_20110620.patch h2087_20110620.patch: added {{readBlock(..)}} only for illustrating the idea. > Add methods to DataTransferProtocol interface > - > > Key: HDFS-2087 > URL: https://issues.apache.org/jira/browse/HDFS-2087 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, hdfs client >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Tsz Wo (Nicholas), SZE > Attachments: h2087_20110620.patch > > > The {{DataTransferProtocol}} interface is currently empty. The {{Sender}} > and {{Receiver}} define similar methods individually. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2077) 1073: address checkpoint upload when one of the storage dirs is failed
[ https://issues.apache.org/jira/browse/HDFS-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052305#comment-13052305 ] Todd Lipcon commented on HDFS-2077: --- just found a good bug in this while doing some fault testing... in reportErrorOnFile, it will mis-ascribe an error sometimes if one namenode directory is a prefix of the other... eg if the storage dirs are /data/name and /data/name2, it will ascribe an error in /data/name2/... to /data/name. > 1073: address checkpoint upload when one of the storage dirs is failed > -- > > Key: HDFS-2077 > URL: https://issues.apache.org/jira/browse/HDFS-2077 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2077.txt > > > This JIRA addresses the following case: > - NN is running with 2 storage dirs > - 1 of the dirs fails > - 2NN makes a checkpoint > Currently, if GetImageServlet fails to open _any_ of the local files to > receive a checkpoint, it will fail the entire checkpoint upload process. > Instead, it should continue to receive checkpoints in the non-failed > directories. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2082) SecondaryNameNode web interface doesn't show the right info
[ https://issues.apache.org/jira/browse/HDFS-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052303#comment-13052303 ] Aaron T. Myers commented on HDFS-2082: -- I'm pretty confident this test failure was spurious. The test just passed locally on my box. Curiously, I've never seen {{TestSetTimes}} fail though. > SecondaryNameNode web interface doesn't show the right info > --- > > Key: HDFS-2082 > URL: https://issues.apache.org/jira/browse/HDFS-2082 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0, 0.23.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers > Fix For: 0.23.0 > > Attachments: hdfs-2082.0.patch, hdfs-2082.1.patch, hdfs-2082.2.patch, > hdfs-2082.3.patch > > > HADOOP-3741 introduced some useful info to the 2NN web UI. This broke when > security was added. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2082) SecondaryNameNode web interface doesn't show the right info
[ https://issues.apache.org/jira/browse/HDFS-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052299#comment-13052299 ] Hadoop QA commented on HDFS-2082: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483014/hdfs-2082.3.patch against trunk revision 1137675. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestSetTimes +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/801//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/801//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/801//console This message is automatically generated. > SecondaryNameNode web interface doesn't show the right info > --- > > Key: HDFS-2082 > URL: https://issues.apache.org/jira/browse/HDFS-2082 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0, 0.23.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers > Fix For: 0.23.0 > > Attachments: hdfs-2082.0.patch, hdfs-2082.1.patch, hdfs-2082.2.patch, > hdfs-2082.3.patch > > > HADOOP-3741 introduced some useful info to the 2NN web UI. This broke when > security was added. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2086) If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
[ https://issues.apache.org/jira/browse/HDFS-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052297#comment-13052297 ] Jitendra Nath Pandey commented on HDFS-2086: 1. inHostsList and inExcludeHostsList do same things on two different lists. Both can use a single method that also takes the list as argument. 2. Do we really need to look into hostsList for both node.getName and iaddr.getHostName? I understand node.getName may actually be returning the ip:port, but for IP iaddr.getHostAddress is more reliable. Caveat with the later approach: Can we assume ipAddr and node (DatanodeID) will always be for the same host? Minor: Indentation in checkIncludeListForDead. > If the include hosts list contains host name, after restarting namenode, > datanodes registrant is denied > > > Key: HDFS-2086 > URL: https://issues.apache.org/jira/browse/HDFS-2086 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: Tanping Wang >Assignee: Tanping Wang > Fix For: 0.23.0 > > Attachments: HDFS-2086.patch > > > As the title describes the problem: if the include host list contains host > name, after restarting namenodes, the datanodes registrant is denied by > namenodes. This is because after namenode is restarted, the still alive data > node will try to register itself with the namenode and it identifies itself > with its *IP address*. However, namenode only allows all the hosts in its > hosts list to registrant and all of them are hostnames. So namenode would > deny the datanode registration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2086) If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
[ https://issues.apache.org/jira/browse/HDFS-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jitendra Nath Pandey updated HDFS-2086: --- Status: Patch Available (was: Open) > If the include hosts list contains host name, after restarting namenode, > datanodes registrant is denied > > > Key: HDFS-2086 > URL: https://issues.apache.org/jira/browse/HDFS-2086 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: Tanping Wang >Assignee: Tanping Wang > Fix For: 0.23.0 > > Attachments: HDFS-2086.patch > > > As the title describes the problem: if the include host list contains host > name, after restarting namenodes, the datanodes registrant is denied by > namenodes. This is because after namenode is restarted, the still alive data > node will try to register itself with the namenode and it identifies itself > with its *IP address*. However, namenode only allows all the hosts in its > hosts list to registrant and all of them are hostnames. So namenode would > deny the datanode registration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2089) new hadoop-config.sh doesn't manage classpath for HADOOP_CONF_DIR correctly
new hadoop-config.sh doesn't manage classpath for HADOOP_CONF_DIR correctly --- Key: HDFS-2089 URL: https://issues.apache.org/jira/browse/HDFS-2089 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.0 Reporter: Todd Lipcon Fix For: 0.23.0 Since the introduction of the RPM packages, hadoop-config.sh incorrectly puts $HADOOP_HDFS_HOME/conf on the classpath regardless of whether HADOOP_CONF_DIR is already defined in the environment. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2086) If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
[ https://issues.apache.org/jira/browse/HDFS-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanping Wang updated HDFS-2086: --- Attachment: HDFS-2086.patch > If the include hosts list contains host name, after restarting namenode, > datanodes registrant is denied > > > Key: HDFS-2086 > URL: https://issues.apache.org/jira/browse/HDFS-2086 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: Tanping Wang >Assignee: Tanping Wang > Fix For: 0.23.0 > > Attachments: HDFS-2086.patch > > > As the title describes the problem: if the include host list contains host > name, after restarting namenodes, the datanodes registrant is denied by > namenodes. This is because after namenode is restarted, the still alive data > node will try to register itself with the namenode and it identifies itself > with its *IP address*. However, namenode only allows all the hosts in its > hosts list to registrant and all of them are hostnames. So namenode would > deny the datanode registration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2088) Move edits log archiving logic into FSEditLog/JournalManager
[ https://issues.apache.org/jira/browse/HDFS-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052265#comment-13052265 ] Todd Lipcon commented on HDFS-2088: --- btw, the above patch sequences after the following: HDFS-2074, HDFS-2085, HDFS-2026, HDFS-2077, HDFS-2078. > Move edits log archiving logic into FSEditLog/JournalManager > > > Key: HDFS-2088 > URL: https://issues.apache.org/jira/browse/HDFS-2088 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2088.txt > > > Currently the logic to archive edits logs is File-specific which presents > some issues for Ivan's work. Since it relies on inspecting storage > directories using NNStorage.inspectStorageDirs, it also misses directories > that the image layer considers "failed" which results in edits logs piling up > in these kinds of directories. This JIRA is similar to HDFS-2018 but only > deals with archival for now. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2088) Move edits log archiving logic into FSEditLog/JournalManager
[ https://issues.apache.org/jira/browse/HDFS-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2088: -- Attachment: hdfs-2088.txt Patch does the following: - once StorageArchivalManager determines the minimum txid that needs to be retained, it simply passes it along to FSEditLog.archiveLogsOlderThan. - FSEditLog now propagates this through to all of the journal managers - refactors some code in FSImageTransactionalStorageInspector into a static method {{matchEditLogs}} so that FileJournalManager can share it. This will eventually move into FileJournalManager itself like Ivan did in HDFS-2018, once the load-time stuff gets split up. - adds a functional test to show that edits logs keep getting archived in an edits directory even if it's considered "failed" as an image directory > Move edits log archiving logic into FSEditLog/JournalManager > > > Key: HDFS-2088 > URL: https://issues.apache.org/jira/browse/HDFS-2088 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2088.txt > > > Currently the logic to archive edits logs is File-specific which presents > some issues for Ivan's work. Since it relies on inspecting storage > directories using NNStorage.inspectStorageDirs, it also misses directories > that the image layer considers "failed" which results in edits logs piling up > in these kinds of directories. This JIRA is similar to HDFS-2018 but only > deals with archival for now. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2088) Move edits log archiving logic into FSEditLog/JournalManager
[ https://issues.apache.org/jira/browse/HDFS-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2088: -- Component/s: name-node Description: Currently the logic to archive edits logs is File-specific which presents some issues for Ivan's work. Since it relies on inspecting storage directories using NNStorage.inspectStorageDirs, it also misses directories that the image layer considers "failed" which results in edits logs piling up in these kinds of directories. This JIRA is similar to HDFS-2018 but only deals with archival for now. Affects Version/s: Edit log branch (HDFS-1073) Fix Version/s: Edit log branch (HDFS-1073) > Move edits log archiving logic into FSEditLog/JournalManager > > > Key: HDFS-2088 > URL: https://issues.apache.org/jira/browse/HDFS-2088 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > > Currently the logic to archive edits logs is File-specific which presents > some issues for Ivan's work. Since it relies on inspecting storage > directories using NNStorage.inspectStorageDirs, it also misses directories > that the image layer considers "failed" which results in edits logs piling up > in these kinds of directories. This JIRA is similar to HDFS-2018 but only > deals with archival for now. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2018) Move all journal stream management code into one place
[ https://issues.apache.org/jira/browse/HDFS-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052262#comment-13052262 ] Todd Lipcon commented on HDFS-2018: --- Some comments on this patch: - general idea seems right... - I think some of it overlaps with HDFS-2085 and HDFS-2074, which are awaiting review. Can you take a look at those? - for getEditLogManifest I think you need to support the case that different journal managers will have different sets of logs, but we need to be able to transfer all of them. ie imagine the case with two edits directories where one fails, comes back, then the other fails. In that case you need to interleave copying txns from both of them when transferring edits to the 2NN. - I just opened HDFS-2088 and about to put a patch up there in a few minutes. That deals with the archiving logic and makes some similar changes (eg refactoring some stuff out of FSImageTransactionalStorageInspector into FileJournalManager) Let me see if I can merge some of your work into my branch -- sorry that I'm a few patches ahead of what's been committed. > Move all journal stream management code into one place > -- > > Key: HDFS-2018 > URL: https://issues.apache.org/jira/browse/HDFS-2018 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ivan Kelly >Assignee: Ivan Kelly > Fix For: Edit log branch (HDFS-1073) > > Attachments: HDFS-2018.diff, HDFS-2018.diff, HDFS-2018.diff > > > Currently in the HDFS-1073 branch, the code for creating output streams is in > FileJournalManager and the code for input streams is in the inspectors. This > change does a number of things. > - Input and Output streams are now created by the JournalManager. > - FSImageStorageInspectors now deals with URIs when referring to edit logs > - Recovery of inprogress logs is performed by counting the number of > transactions instead of looking at the length of the file. > The patch for this applies on top of the HDFS-1073 branch + HDFS-2003 patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2088) Move edits log archiving logic into FSEditLog/JournalManager
Move edits log archiving logic into FSEditLog/JournalManager Key: HDFS-2088 URL: https://issues.apache.org/jira/browse/HDFS-2088 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Todd Lipcon Assignee: Todd Lipcon -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2083) Adopt JMXJsonServlet into HDFS in order to query statistics
[ https://issues.apache.org/jira/browse/HDFS-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052258#comment-13052258 ] Suresh Srinivas commented on HDFS-2083: --- # Minor: Make this a staic string? "/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo" # Why do you need to create new string in readOutput() {{out.append(new String(buffer, 0, len));}} # Every time you need a property, you are querying mbean. Can you do it only once and hold on to the response? # NamenodeMXBeanHelper javadoc needs to be updated (It still talks about JMX, also look at other references that talks about JMX access) # Stream from URLConnection needs to be closed. > Adopt JMXJsonServlet into HDFS in order to query statistics > --- > > Key: HDFS-2083 > URL: https://issues.apache.org/jira/browse/HDFS-2083 > Project: Hadoop HDFS > Issue Type: New Feature >Affects Versions: 0.23.0 >Reporter: Tanping Wang >Assignee: Tanping Wang > Fix For: 0.23.0 > > Attachments: HDFS-2083.patch > > > HADOOP-7144 added JMXJsonServlet into Common. It gives the capability to > query statistics and metrics exposed via JMX to be queried through HTTP. We > adopt this into HDFS. This provides the alternative solution to HDFS-1874. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2084) Sometimes backup node/secondary name node stops with exception
[ https://issues.apache.org/jira/browse/HDFS-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052248#comment-13052248 ] Konstantin Shvachko commented on HDFS-2084: --- Looks like the checkpoint contains a record which tries to set time on a non-existing file. This should not happen. So the question is how did it happen? If it's a bug we should fix the cause. I don't see how, but if it's a legal scenario, then we can suppress NPE as you suggest. > Sometimes backup node/secondary name node stops with exception > -- > > Key: HDFS-2084 > URL: https://issues.apache.org/jira/browse/HDFS-2084 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.21.0 > Environment: FreeBSD >Reporter: Vitalii Tymchyshyn > Attachments: patch.diff > > > 2011-06-17 11:43:23,096 ERROR > org.apache.hadoop.hdfs.server.namenode.Checkpointer: Throwable Exception in > doCheckpoint: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209) > at > org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2087) Add methods to DataTransferProtocol interface
Add methods to DataTransferProtocol interface - Key: HDFS-2087 URL: https://issues.apache.org/jira/browse/HDFS-2087 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node, hdfs client Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE The {{DataTransferProtocol}} interface is currently empty. The {{Sender}} and {{Receiver}} define similar methods individually. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2086) If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
[ https://issues.apache.org/jira/browse/HDFS-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052244#comment-13052244 ] Tanping Wang commented on HDFS-2086: There is a second part of this problem. When namenode checks datanode status by calling FSNameSystem#getDatanodeListForReport The namenode goes over its include hosts list and tries to determine if all the hosts in the include list are all registrant. If not, the host will be added into the dead list. In our case, as just mentioned early, after the namenode restarted, datanode registrants itself with its *IP address*. But the include list still contains its *host name*. So the hostname is not recognized by namenode and is added into the dead list when namenode reports the datanodes status. > If the include hosts list contains host name, after restarting namenode, > datanodes registrant is denied > > > Key: HDFS-2086 > URL: https://issues.apache.org/jira/browse/HDFS-2086 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: Tanping Wang > Fix For: 0.23.0 > > > As the tile describes the problem: if the include host list contains host > name, after restarting namenodes, the datanodes registrant is denied by > namenodes. This is because after namenode is restarted, the still alive data > node will try to register itself with the namenode and it identifies itself > with its *IP address*. However, namenode only allows all the hosts in its > hosts list to registrant and all of them are hostnames. So namenode would > deny the datanode registration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2086) If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
[ https://issues.apache.org/jira/browse/HDFS-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanping Wang updated HDFS-2086: --- Description: As the title describes the problem: if the include host list contains host name, after restarting namenodes, the datanodes registrant is denied by namenodes. This is because after namenode is restarted, the still alive data node will try to register itself with the namenode and it identifies itself with its *IP address*. However, namenode only allows all the hosts in its hosts list to registrant and all of them are hostnames. So namenode would deny the datanode registration. was: As the tile describes the problem: if the include host list contains host name, after restarting namenodes, the datanodes registrant is denied by namenodes. This is because after namenode is restarted, the still alive data node will try to register itself with the namenode and it identifies itself with its *IP address*. However, namenode only allows all the hosts in its hosts list to registrant and all of them are hostnames. So namenode would deny the datanode registration. > If the include hosts list contains host name, after restarting namenode, > datanodes registrant is denied > > > Key: HDFS-2086 > URL: https://issues.apache.org/jira/browse/HDFS-2086 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: Tanping Wang >Assignee: Tanping Wang > Fix For: 0.23.0 > > > As the title describes the problem: if the include host list contains host > name, after restarting namenodes, the datanodes registrant is denied by > namenodes. This is because after namenode is restarted, the still alive data > node will try to register itself with the namenode and it identifies itself > with its *IP address*. However, namenode only allows all the hosts in its > hosts list to registrant and all of them are hostnames. So namenode would > deny the datanode registration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2086) If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
[ https://issues.apache.org/jira/browse/HDFS-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanping Wang updated HDFS-2086: --- Component/s: name-node Affects Version/s: 0.23.0 Fix Version/s: 0.23.0 Assignee: Tanping Wang > If the include hosts list contains host name, after restarting namenode, > datanodes registrant is denied > > > Key: HDFS-2086 > URL: https://issues.apache.org/jira/browse/HDFS-2086 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.0 >Reporter: Tanping Wang >Assignee: Tanping Wang > Fix For: 0.23.0 > > > As the tile describes the problem: if the include host list contains host > name, after restarting namenodes, the datanodes registrant is denied by > namenodes. This is because after namenode is restarted, the still alive data > node will try to register itself with the namenode and it identifies itself > with its *IP address*. However, namenode only allows all the hosts in its > hosts list to registrant and all of them are hostnames. So namenode would > deny the datanode registration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2086) If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied Key: HDFS-2086 URL: https://issues.apache.org/jira/browse/HDFS-2086 Project: Hadoop HDFS Issue Type: Bug Reporter: Tanping Wang As the tile describes the problem: if the include host list contains host name, after restarting namenodes, the datanodes registrant is denied by namenodes. This is because after namenode is restarted, the still alive data node will try to register itself with the namenode and it identifies itself with its *IP address*. However, namenode only allows all the hosts in its hosts list to registrant and all of them are hostnames. So namenode would deny the datanode registration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2018) Move all journal stream management code into one place
[ https://issues.apache.org/jira/browse/HDFS-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052183#comment-13052183 ] Hadoop QA commented on HDFS-2018: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483194/HDFS-2018.diff against trunk revision 1137675. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 20 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/800//console This message is automatically generated. > Move all journal stream management code into one place > -- > > Key: HDFS-2018 > URL: https://issues.apache.org/jira/browse/HDFS-2018 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ivan Kelly >Assignee: Ivan Kelly > Fix For: Edit log branch (HDFS-1073) > > Attachments: HDFS-2018.diff, HDFS-2018.diff, HDFS-2018.diff > > > Currently in the HDFS-1073 branch, the code for creating output streams is in > FileJournalManager and the code for input streams is in the inspectors. This > change does a number of things. > - Input and Output streams are now created by the JournalManager. > - FSImageStorageInspectors now deals with URIs when referring to edit logs > - Recovery of inprogress logs is performed by counting the number of > transactions instead of looking at the length of the file. > The patch for this applies on top of the HDFS-1073 branch + HDFS-2003 patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2018) Move all journal stream management code into one place
[ https://issues.apache.org/jira/browse/HDFS-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052180#comment-13052180 ] Ivan Kelly commented on HDFS-2018: -- Forgot to mention, the cleanup I plan to do is to remove LoadPlan as its not needed anymore. Also testing, as this needs a good few tests to verify the functionality. > Move all journal stream management code into one place > -- > > Key: HDFS-2018 > URL: https://issues.apache.org/jira/browse/HDFS-2018 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ivan Kelly >Assignee: Ivan Kelly > Fix For: Edit log branch (HDFS-1073) > > Attachments: HDFS-2018.diff, HDFS-2018.diff, HDFS-2018.diff > > > Currently in the HDFS-1073 branch, the code for creating output streams is in > FileJournalManager and the code for input streams is in the inspectors. This > change does a number of things. > - Input and Output streams are now created by the JournalManager. > - FSImageStorageInspectors now deals with URIs when referring to edit logs > - Recovery of inprogress logs is performed by counting the number of > transactions instead of looking at the length of the file. > The patch for this applies on top of the HDFS-1073 branch + HDFS-2003 patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2018) Move all journal stream management code into one place
[ https://issues.apache.org/jira/browse/HDFS-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Kelly updated HDFS-2018: - Attachment: HDFS-2018.diff Another rough patch, which I'll clean up tomorrow. In this patch, multiple journal streams can be used in loading. PreTransaction stuff is confined solely to FSImage, JournalManagers never know about it. Also file and image loading are completely separate now. > Move all journal stream management code into one place > -- > > Key: HDFS-2018 > URL: https://issues.apache.org/jira/browse/HDFS-2018 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ivan Kelly >Assignee: Ivan Kelly > Fix For: Edit log branch (HDFS-1073) > > Attachments: HDFS-2018.diff, HDFS-2018.diff, HDFS-2018.diff > > > Currently in the HDFS-1073 branch, the code for creating output streams is in > FileJournalManager and the code for input streams is in the inspectors. This > change does a number of things. > - Input and Output streams are now created by the JournalManager. > - FSImageStorageInspectors now deals with URIs when referring to edit logs > - Recovery of inprogress logs is performed by counting the number of > transactions instead of looking at the length of the file. > The patch for this applies on top of the HDFS-1073 branch + HDFS-2003 patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-941) Datanode xceiver protocol should allow reuse of a connection
[ https://issues.apache.org/jira/browse/HDFS-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052166#comment-13052166 ] Konstantin Shvachko commented on HDFS-941: -- Answers to some issues raised here: Stack> RM says whats in a release and no one else. We can still talk about technical merits of the implementation, don't we? Todd> nrFiles <= nrNodes means full locality, right? No. In DFSIO there is no locality, since files that DFSIO reads/writes are not the input of the MR job. Their names are. The reason here is to make sure the job completes in one wave of mappers, and to minimize contention on the drives between tasks. I was trying to avoid making this issue yet another discussion about DFSIO, because the objective here is to verify that the patch does not introduce regression in performance for sequential ios. If the benchmark I proposed doesn't work for you guys, you can propose a different one. Dhruba, Todd, Nicholas> TestDFSIO exhibits very high variance, and its results are dependent on mapreduce's scheduling. DFSIO does not depend on the MR scheduling. It depends on the OS memory cache. Cluster nodes these days run with 16, 32 GB RAM. So a 10GB file almost entirely can be cached by OS. When you repeatedly run DFSIO then you are not measuring cold IO, but RAM access and communication. And high variation is explained by the fact that some data is cached and some is not. For example DFSIO -write is usually very stable with std.dev < 1. This is because it deals with cold writes. For DFSIO -read you need to choose file size larger than your RAM. With sequential reads OS cache works as LRU, so if your file is larger than RAM, the OS cache will "forget" blocks from the head of the file, when you get to reading the tail. And when you start reading the file again cache will release oldest pages, which correspond to the higher offset in the file. So it is going to be cold read. I had to go to 100GB files, which brought std.dev to < 2, and variation in throughput was around 3%. Alternatively you can clean Linux cache on all DataNodes. Nicholas> it is hard to explain what do the "Throughput" and "Average IO rate" really mean. [This post|http://old.nabble.com/Re%3A-TestDFSIO-delivers-bad-values-of-%22throughput%22-and-%22average-IO-rate%22-p21322404.html] has the definitions. Nicholas, I agree with you the results you are posting don't make sense. The point is though not to screw the benchmark, but to find the conditions when it reliably measures what you need. > Datanode xceiver protocol should allow reuse of a connection > > > Key: HDFS-941 > URL: https://issues.apache.org/jira/browse/HDFS-941 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node, hdfs client >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: bc Wong > Fix For: 0.22.0 > > Attachments: 941.22.txt, 941.22.txt, 941.22.v2.txt, 941.22.v3.txt, > HDFS-941-1.patch, HDFS-941-2.patch, HDFS-941-3.patch, HDFS-941-3.patch, > HDFS-941-4.patch, HDFS-941-5.patch, HDFS-941-6.22.patch, HDFS-941-6.patch, > HDFS-941-6.patch, HDFS-941-6.patch, fix-close-delta.txt, hdfs-941.txt, > hdfs-941.txt, hdfs-941.txt, hdfs-941.txt, hdfs941-1.png > > > Right now each connection into the datanode xceiver only processes one > operation. > In the case that an operation leaves the stream in a well-defined state (eg a > client reads to the end of a block successfully) the same connection could be > reused for a second operation. This should improve random read performance > significantly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-941) Datanode xceiver protocol should allow reuse of a connection
[ https://issues.apache.org/jira/browse/HDFS-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052160#comment-13052160 ] Konstantin Shvachko commented on HDFS-941: -- I ran some test myself over the weekend. The results are good. I am getting throughput around 75-78 MB/sec on reads with small (< 2) std.deviation in both cases. So I am +1 now on this patch. > Datanode xceiver protocol should allow reuse of a connection > > > Key: HDFS-941 > URL: https://issues.apache.org/jira/browse/HDFS-941 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node, hdfs client >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: bc Wong > Fix For: 0.22.0 > > Attachments: 941.22.txt, 941.22.txt, 941.22.v2.txt, 941.22.v3.txt, > HDFS-941-1.patch, HDFS-941-2.patch, HDFS-941-3.patch, HDFS-941-3.patch, > HDFS-941-4.patch, HDFS-941-5.patch, HDFS-941-6.22.patch, HDFS-941-6.patch, > HDFS-941-6.patch, HDFS-941-6.patch, fix-close-delta.txt, hdfs-941.txt, > hdfs-941.txt, hdfs-941.txt, hdfs-941.txt, hdfs941-1.png > > > Right now each connection into the datanode xceiver only processes one > operation. > In the case that an operation leaves the stream in a well-defined state (eg a > client reads to the end of a block successfully) the same connection could be > reused for a second operation. This should improve random read performance > significantly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2085) 1073: finalize inprogress edit logs at startup
[ https://issues.apache.org/jira/browse/HDFS-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2085: -- Attachment: hdfs-2085.txt > 1073: finalize inprogress edit logs at startup > -- > > Key: HDFS-2085 > URL: https://issues.apache.org/jira/browse/HDFS-2085 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2085.txt > > > With HDFS-2074, the NameNode can read through any "in-progress" logs it finds > during startup to determine how many transactions they have. It can then > re-name the file from its inprogress name to its finalized name. For example, > if it finds a file edits_10_inprogress with 3 transactions, it can rename it > to edits_10-12 at startup. This means that other parts of the system like > edits-log-transfer don't need to worry about in-progress logs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2085) 1073: finalize inprogress edit logs at startup
[ https://issues.apache.org/jira/browse/HDFS-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2085: -- Description: With HDFS-2074, the NameNode can read through any "in-progress" logs it finds during startup to determine how many transactions they have. It can then re-name the file from its inprogress name to its finalized name. For example, if it finds a file edits_10_inprogress with 3 transactions, it can rename it to edits_10-12 at startup. This means that other parts of the system like edits-log-transfer don't need to worry about in-progress logs. > 1073: finalize inprogress edit logs at startup > -- > > Key: HDFS-2085 > URL: https://issues.apache.org/jira/browse/HDFS-2085 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > > With HDFS-2074, the NameNode can read through any "in-progress" logs it finds > during startup to determine how many transactions they have. It can then > re-name the file from its inprogress name to its finalized name. For example, > if it finds a file edits_10_inprogress with 3 transactions, it can rename it > to edits_10-12 at startup. This means that other parts of the system like > edits-log-transfer don't need to worry about in-progress logs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2085) 1073: finalize inprogress edit logs at startup
[ https://issues.apache.org/jira/browse/HDFS-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2085: -- Component/s: name-node Affects Version/s: Edit log branch (HDFS-1073) Fix Version/s: Edit log branch (HDFS-1073) > 1073: finalize inprogress edit logs at startup > -- > > Key: HDFS-2085 > URL: https://issues.apache.org/jira/browse/HDFS-2085 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > > With HDFS-2074, the NameNode can read through any "in-progress" logs it finds > during startup to determine how many transactions they have. It can then > re-name the file from its inprogress name to its finalized name. For example, > if it finds a file edits_10_inprogress with 3 transactions, it can rename it > to edits_10-12 at startup. This means that other parts of the system like > edits-log-transfer don't need to worry about in-progress logs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2085) 1073: finalize inprogress edit logs at startup
1073: finalize inprogress edit logs at startup -- Key: HDFS-2085 URL: https://issues.apache.org/jira/browse/HDFS-2085 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Todd Lipcon Assignee: Todd Lipcon -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2080) Speed up DFS read path
[ https://issues.apache.org/jira/browse/HDFS-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052119#comment-13052119 ] Todd Lipcon commented on HDFS-2080: --- bq. Did you compare the performance of "software" version with zlib? zlib's implementation iirc is the straightforward byte-by-byte algorithm, whereas the "software" implementation here is the "slicing-by-8" algorithm which generally performs much better. I didn't do a rigorous comparison, though I think I did notice a speedup when I switched from zlib to this implementation. bq. Although it's not in the patch, I am sure you have play with it. Is there anything you found useful in making this work? I did some hacking here: https://github.com/toddlipcon/cpp-dfsclient/blob/master/test_readblock.cc See the read_packet() function and the crc32cHardware64_3parallel(...) code. This code does run faster than the "naive" non-pipelined implementation, though I didn't do a rigorous benchmark here either. I figure it would be best to post the patch above before going all-out on optimization. A few other notes on the patch: - a few unit tests are failing because of bugs in the tests (eg not creating a socket with an associated Channel, or assuming read() will always return the requested size) - the use of native byte buffers could cause a leak - we need some kind of pooling/buffer reuse here to avoid the native memory leak Sadly this project is "for fun" for me at the moment so I probably won't be able to circle back for a little while. I will try to post a patch which addresses some of the above bugs though tonight. > Speed up DFS read path > -- > > Key: HDFS-2080 > URL: https://issues.apache.org/jira/browse/HDFS-2080 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.23.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.23.0 > > Attachments: hdfs-2080.txt > > > I've developed a series of patches that speeds up the HDFS read path by a > factor of about 2.5x (~300M/sec to ~800M/sec for localhost reading from > buffer cache) and also will make it easier to allow for advanced users (eg > hbase) to skip a buffer copy. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2080) Speed up DFS read path
[ https://issues.apache.org/jira/browse/HDFS-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052090#comment-13052090 ] Kihwal Lee commented on HDFS-2080: -- This is awesome! I will review the patch carefully, but I have a couple of questions for now. * Did you compare the performance of "software" version with zlib? Just to make sure we fallback to a better one. If zlib's crc32 doesn't perform significantly better, using what we have will be simpler for supporting different polynomials. * I did a bit of experiment about filling up the pipeline. When there is no data dependency, I get 1.17 cycles/Qword. By dividing the buffer into three chunks, I get about 1.6 - 1.7 cycles/Qword. This is before combining results and processing remainder. I didn't tweak too much, so it might be possible to make it a bit better. Although it's not in the patch, I am sure you have play with it. Is there anything you found useful in making this work? > Speed up DFS read path > -- > > Key: HDFS-2080 > URL: https://issues.apache.org/jira/browse/HDFS-2080 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.23.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.23.0 > > Attachments: hdfs-2080.txt > > > I've developed a series of patches that speeds up the HDFS read path by a > factor of about 2.5x (~300M/sec to ~800M/sec for localhost reading from > buffer cache) and also will make it easier to allow for advanced users (eg > hbase) to skip a buffer copy. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1765) Block Replication should respect under-replication block priority
[ https://issues.apache.org/jira/browse/HDFS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052086#comment-13052086 ] Haryadi Gunawi commented on HDFS-1765: -- I agree with Hairong. Recently, I've been playing around with this, and found the same problem as shown in the attachment (underReplicatedQueue.pdf). At a high-level, if the round-robin iterator is in queue-2 (queue with priority=2), then the UR blocks in queue-0 must wait until the iterator wraps to queue-0 again. So, I assume, in worst case, if queue-2 is long (as depicted in the graph), the UR blocks in queue-0 will take a very long time to be served! The setup of the figure: I have 20 nodes. Each node holds 3000 blocks. I fail 4 nodes. q-0: UR blocks with 1 replica q-2: UR blocks with 2 replicas pq: pending queue (I stopped the experiment in the middle, because the pattern is obvious) More details why the round-robin iterator does not work: It is true that round-robin iterates through queue-0 first, but the replication monitor runs this logic: - choose a block B to be replicated - pick a source node S that still has B - BUT if S were already chosen to replicate other blocks (i.e. S' rep stream is already larger than the maxrepstream(2)), then increment the iterator (and thus this block B in queue-0 will not be served until the round-robin iterator wraps). And if other queues (e.g. q1 and q2) are super long, then queue-0 might be starved for a long time. > Block Replication should respect under-replication block priority > - > > Key: HDFS-1765 > URL: https://issues.apache.org/jira/browse/HDFS-1765 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Affects Versions: 0.23.0 >Reporter: Hairong Kuang >Assignee: Hairong Kuang > Fix For: 0.23.0 > > > Currently under-replicated blocks are assigned different priorities depending > on how many replicas a block has. However the replication monitor works on > blocks in a round-robin fashion. So the newly added high priority blocks > won't get replicated until all low-priority blocks are done. One example is > that on decommissioning datanode WebUI we often observe that "blocks with > only decommissioning replicas" do not get scheduled to replicate before other > blocks, so risking data availability if the node is shutdown for repair > before decommission completes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1765) Block Replication should respect under-replication block priority
[ https://issues.apache.org/jira/browse/HDFS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haryadi Gunawi updated HDFS-1765: - Attachment: underReplicatedQueue.pdf > Block Replication should respect under-replication block priority > - > > Key: HDFS-1765 > URL: https://issues.apache.org/jira/browse/HDFS-1765 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Affects Versions: 0.23.0 >Reporter: Hairong Kuang >Assignee: Hairong Kuang > Fix For: 0.23.0 > > Attachments: underReplicatedQueue.pdf > > > Currently under-replicated blocks are assigned different priorities depending > on how many replicas a block has. However the replication monitor works on > blocks in a round-robin fashion. So the newly added high priority blocks > won't get replicated until all low-priority blocks are done. One example is > that on decommissioning datanode WebUI we often observe that "blocks with > only decommissioning replicas" do not get scheduled to replicate before other > blocks, so risking data availability if the node is shutdown for repair > before decommission completes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1568) Improve DataXceiver error logging
[ https://issues.apache.org/jira/browse/HDFS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052079#comment-13052079 ] Joey Echeverria commented on HDFS-1568: --- Thanks. > Improve DataXceiver error logging > - > > Key: HDFS-1568 > URL: https://issues.apache.org/jira/browse/HDFS-1568 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node >Affects Versions: 0.23.0 >Reporter: Todd Lipcon >Assignee: Joey Echeverria >Priority: Minor > Labels: newbie > Attachments: HDFS-1568-1.patch, HDFS-1568-3.patch, HDFS-1568-4.patch, > HDFS-1568-5.patch, HDFS-1568-6.patch, HDFS-1568-output-changes.patch > > > In supporting customers we often see things like SocketTimeoutExceptions or > EOFExceptions coming from DataXceiver, but the logging isn't very good. For > example, if we get an IOE while setting up a connection to the downstream > mirror in writeBlock, the IP of the downstream mirror isn't logged on the DN > side. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2084) Sometimes backup node/secondary name node stops with exception
[ https://issues.apache.org/jira/browse/HDFS-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052048#comment-13052048 ] Hadoop QA commented on HDFS-2084: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483166/patch.diff against trunk revision 1137675. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/799//console This message is automatically generated. > Sometimes backup node/secondary name node stops with exception > -- > > Key: HDFS-2084 > URL: https://issues.apache.org/jira/browse/HDFS-2084 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.21.0 > Environment: FreeBSD >Reporter: Vitalii Tymchyshyn > Attachments: patch.diff > > > 2011-06-17 11:43:23,096 ERROR > org.apache.hadoop.hdfs.server.namenode.Checkpointer: Throwable Exception in > doCheckpoint: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209) > at > org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1568) Improve DataXceiver error logging
[ https://issues.apache.org/jira/browse/HDFS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052043#comment-13052043 ] Tsz Wo (Nicholas), SZE commented on HDFS-1568: -- The previous build failed unexpectedly. I have restarted a new build. > Improve DataXceiver error logging > - > > Key: HDFS-1568 > URL: https://issues.apache.org/jira/browse/HDFS-1568 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node >Affects Versions: 0.23.0 >Reporter: Todd Lipcon >Assignee: Joey Echeverria >Priority: Minor > Labels: newbie > Attachments: HDFS-1568-1.patch, HDFS-1568-3.patch, HDFS-1568-4.patch, > HDFS-1568-5.patch, HDFS-1568-6.patch, HDFS-1568-output-changes.patch > > > In supporting customers we often see things like SocketTimeoutExceptions or > EOFExceptions coming from DataXceiver, but the logging isn't very good. For > example, if we get an IOE while setting up a connection to the downstream > mirror in writeBlock, the IP of the downstream mirror isn't logged on the DN > side. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-420) Fuse-dfs should cache fs handles
[ https://issues.apache.org/jira/browse/HDFS-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052042#comment-13052042 ] Hudson commented on HDFS-420: - Integrated in Hadoop-Hdfs-trunk-Commit #751 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/751/]) HDFS-420. Fuse-dfs should cache fs handles. Contributed by Brian Bockelman and Eli Collins eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1137675 Files : * /hadoop/common/trunk/hdfs/src/contrib/build-contrib.xml * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_unlink.c * /hadoop/common/trunk/hdfs/CHANGES.txt * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_getattr.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_release.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_utimens.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_options.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_stat_struct.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_dfs_wrapper.sh * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_rename.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_mkdir.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_statfs.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_rmdir.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/build.xml * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_users.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_init.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_access.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/configure.ac * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_truncate.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_connect.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_readdir.c * /hadoop/common/trunk/hdfs/src/contrib/build.xml * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_open.c * /hadoop/common/trunk/hdfs/src/c++/libhdfs/hdfs.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_connect.h * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_dfs.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_chmod.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_impls_chown.c * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_context_handle.h * /hadoop/common/trunk/hdfs/src/contrib/fuse-dfs/src/fuse_dfs.h > Fuse-dfs should cache fs handles > > > Key: HDFS-420 > URL: https://issues.apache.org/jira/browse/HDFS-420 > Project: Hadoop HDFS > Issue Type: Improvement > Components: contrib/fuse-dfs >Affects Versions: 0.20.2 > Environment: Fedora core 10, x86_64, 2.6.27.7-134.fc10.x86_64 #1 SMP > (AMD 64), gcc 4.3.2, java 1.6.0 (IcedTea6 1.4 (fedora-7.b12.fc10-x86_64) > Runtime Environment (build 1.6.0_0-b12) OpenJDK 64-Bit Server VM (build > 10.0-b19, mixed mode) >Reporter: Dima Brodsky >Assignee: Brian Bockelman > Fix For: 0.23.0 > > Attachments: fuse_dfs_020_memleaks.patch, > fuse_dfs_020_memleaks_v3.patch, fuse_dfs_020_memleaks_v8.patch, > hdfs-420-1.patch, hdfs-420-2.patch, hdfs-420-3.patch > > > Fuse-dfs should cache fs handles on a per-user basis. This significantly > increases performance, and has the side effect of fixing the current code > which leaks fs handles. > The original bug description follows: > I run the following test: > 1. Run hadoop DFS in single node mode > 2. start up fuse_dfs > 3. copy my source tree, about 250 megs, into the DFS > cp -av * /mnt/hdfs/ > in /var/log/messages I keep seeing: > Dec 22 09:02:08 bodum fuse_dfs: ERROR: hdfs trying to utime > /bar/backend-trunk2/src/machinery/hadoop/output/2008/11/19 to > 1229385138/1229963739 > and then eventually > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1209 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: cou
[jira] [Updated] (HDFS-2084) Sometimes backup node/secondary name node stops with exception
[ https://issues.apache.org/jira/browse/HDFS-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Tymchyshyn updated HDFS-2084: - Attachment: patch.diff This is patch to skip such entries. Note that it's against my "own copy" of 0.21 release tag, so revisions are from my svn > Sometimes backup node/secondary name node stops with exception > -- > > Key: HDFS-2084 > URL: https://issues.apache.org/jira/browse/HDFS-2084 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.21.0 > Environment: FreeBSD >Reporter: Vitalii Tymchyshyn > Attachments: patch.diff > > > 2011-06-17 11:43:23,096 ERROR > org.apache.hadoop.hdfs.server.namenode.Checkpointer: Throwable Exception in > doCheckpoint: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209) > at > org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2084) Sometimes backup node/secondary name node stops with exception
[ https://issues.apache.org/jira/browse/HDFS-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Tymchyshyn updated HDFS-2084: - Tags: workaround Status: Patch Available (was: Open) This is my patch to skip such problematic entries > Sometimes backup node/secondary name node stops with exception > -- > > Key: HDFS-2084 > URL: https://issues.apache.org/jira/browse/HDFS-2084 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.21.0 > Environment: FreeBSD >Reporter: Vitalii Tymchyshyn > > 2011-06-17 11:43:23,096 ERROR > org.apache.hadoop.hdfs.server.namenode.Checkpointer: Throwable Exception in > doCheckpoint: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209) > at > org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243) > at > org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-2084) Sometimes backup node/secondary name node stops with exception
Sometimes backup node/secondary name node stops with exception -- Key: HDFS-2084 URL: https://issues.apache.org/jira/browse/HDFS-2084 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0 Environment: FreeBSD Reporter: Vitalii Tymchyshyn 2011-06-17 11:43:23,096 ERROR org.apache.hadoop.hdfs.server.namenode.Checkpointer: Throwable Exception in doCheckpoint: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209) at org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158) at org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243) at org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-420) Fuse-dfs should cache fs handles
[ https://issues.apache.org/jira/browse/HDFS-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-420: - Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) I've committed this. Thanks Brian and Todd. > Fuse-dfs should cache fs handles > > > Key: HDFS-420 > URL: https://issues.apache.org/jira/browse/HDFS-420 > Project: Hadoop HDFS > Issue Type: Improvement > Components: contrib/fuse-dfs >Affects Versions: 0.20.2 > Environment: Fedora core 10, x86_64, 2.6.27.7-134.fc10.x86_64 #1 SMP > (AMD 64), gcc 4.3.2, java 1.6.0 (IcedTea6 1.4 (fedora-7.b12.fc10-x86_64) > Runtime Environment (build 1.6.0_0-b12) OpenJDK 64-Bit Server VM (build > 10.0-b19, mixed mode) >Reporter: Dima Brodsky >Assignee: Brian Bockelman > Fix For: 0.23.0 > > Attachments: fuse_dfs_020_memleaks.patch, > fuse_dfs_020_memleaks_v3.patch, fuse_dfs_020_memleaks_v8.patch, > hdfs-420-1.patch, hdfs-420-2.patch, hdfs-420-3.patch > > > Fuse-dfs should cache fs handles on a per-user basis. This significantly > increases performance, and has the side effect of fixing the current code > which leaks fs handles. > The original bug description follows: > I run the following test: > 1. Run hadoop DFS in single node mode > 2. start up fuse_dfs > 3. copy my source tree, about 250 megs, into the DFS > cp -av * /mnt/hdfs/ > in /var/log/messages I keep seeing: > Dec 22 09:02:08 bodum fuse_dfs: ERROR: hdfs trying to utime > /bar/backend-trunk2/src/machinery/hadoop/output/2008/11/19 to > 1229385138/1229963739 > and then eventually > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1209 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1209 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1333 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1209 > Dec 22 09:03:49 bodum fuse_dfs: ERROR: could not connect to dfs > fuse_dfs.c:1037 > and the file system hangs. hadoop is still running and I don't see any > errors in it's logs. I have to unmount the dfs and restart fuse_dfs and then > everything is fine again. At some point I see the following messages in the > /var/log/messages: > ERROR: dfs problem - could not close file_handle(139677114350528) for > /bar/backend-trunk2/src/machinery/hadoop/input/2008/12/14/actionrecordlog-8339-93825052368848-1229278807.log > fuse_dfs.c:1464 > Dec 22 09:04:49 bodum fuse_dfs: ERROR: dfs problem - could not close > file_handle(139676770220176) for > /bar/backend-trunk2/src/machinery/hadoop/input/2008/12/14/actionrecordlog-8140-93825025883216-1229278759.log > fuse_dfs.c:1464 > Dec 22 09:05:13 bodum fuse_dfs: ERROR: dfs problem - could not close > file_handle(139677114812832) for > /bar/backend-trunk2/src/machinery/hadoop/input/2008/12/14/actionrecordlog-8138-93825070138960-1229251587.log > fuse_dfs.c:1464 > Is this a known issue? Am I just flooding the system too much. All of this > is being performed on a single, dual core, machine. > Thanks! > ttyl > Dima -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2034) length in getBlockRange becomes -ve when reading only from currently being written blk
[ https://issues.apache.org/jira/browse/HDFS-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John George updated HDFS-2034: -- Status: Patch Available (was: Open) > length in getBlockRange becomes -ve when reading only from currently being > written blk > -- > > Key: HDFS-2034 > URL: https://issues.apache.org/jira/browse/HDFS-2034 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: John George >Assignee: John George >Priority: Minor > Attachments: HDFS-2034-1.patch, HDFS-2034-1.patch, HDFS-2034-2.patch, > HDFS-2034-3.patch, HDFS-2034-4.patch, HDFS-2034.patch > > > This came up during HDFS-1907. Posting an example that Todd posted in > HDFS-1907 that brought out this issue. > {quote} > Here's an example sequence to describe what I mean: > 1. open file, write one and a half blocks > 2. call hflush > 3. another reader asks for the first byte of the second block > {quote} > In this case since offset is greater than the completed block length, the > math in getBlockRange() of DFSInputStreamer.java will set "length" to > negative. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira