from:"Thanh Do \(JIRA\)"

[jira] [Commented] (HDFS-7018) Implement hdfs.h interface in libhdfs3

2015-03-14 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362157#comment-14362157
 ] 

Thanh Do commented on HDFS-7018:


Could you point me to the file that implement those API? I couldn't find 
Hdfs.cc or Hdfs.cpp in libhdfs3 folder.

 Implement hdfs.h interface in libhdfs3
 --

 Key: HDFS-7018
 URL: https://issues.apache.org/jira/browse/HDFS-7018
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Zhanwei Wang
 Fix For: HDFS-6994

 Attachments: HDFS-7018-pnative.002.patch, 
 HDFS-7018-pnative.003.patch, HDFS-7018-pnative.004.patch, 
 HDFS-7018-pnative.005.patch, HDFS-7018.patch


 Implement C interface for libhdfs3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7018) Implement hdfs.h interface in libhdfs3

2015-03-14 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362159#comment-14362159
 ] 

Thanh Do commented on HDFS-7018:


Could you point me to the file that implement those API? I couldn't find 
Hdfs.cc or Hdfs.cpp in libhdfs3 folder.

 Implement hdfs.h interface in libhdfs3
 --

 Key: HDFS-7018
 URL: https://issues.apache.org/jira/browse/HDFS-7018
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Zhanwei Wang
 Fix For: HDFS-6994

 Attachments: HDFS-7018-pnative.002.patch, 
 HDFS-7018-pnative.003.patch, HDFS-7018-pnative.004.patch, 
 HDFS-7018-pnative.005.patch, HDFS-7018.patch


 Implement C interface for libhdfs3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7018) Implement hdfs.h interface in libhdfs3

2015-03-14 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362158#comment-14362158
 ] 

Thanh Do commented on HDFS-7018:


Could you point me to the file that implement those API? I couldn't find 
Hdfs.cc or Hdfs.cpp in libhdfs3 folder.

 Implement hdfs.h interface in libhdfs3
 --

 Key: HDFS-7018
 URL: https://issues.apache.org/jira/browse/HDFS-7018
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Zhanwei Wang
 Fix For: HDFS-6994

 Attachments: HDFS-7018-pnative.002.patch, 
 HDFS-7018-pnative.003.patch, HDFS-7018-pnative.004.patch, 
 HDFS-7018-pnative.005.patch, HDFS-7018.patch


 Implement C interface for libhdfs3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7018) Implement hdfs.h interface in libhdfs3

2015-03-14 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361843#comment-14361843
 ] 

Thanh Do commented on HDFS-7018:


This patch is not committed to the branch yet, right?

 Implement hdfs.h interface in libhdfs3
 --

 Key: HDFS-7018
 URL: https://issues.apache.org/jira/browse/HDFS-7018
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Zhanwei Wang
 Fix For: HDFS-6994

 Attachments: HDFS-7018-pnative.002.patch, 
 HDFS-7018-pnative.003.patch, HDFS-7018-pnative.004.patch, 
 HDFS-7018-pnative.005.patch, HDFS-7018.patch


 Implement C interface for libhdfs3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7360) Test libhdfs3 against MiniDFSCluster

2015-03-10 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356161#comment-14356161
 ] 

Thanh Do commented on HDFS-7360:


+1.

 Test libhdfs3 against MiniDFSCluster
 

 Key: HDFS-7360
 URL: https://issues.apache.org/jira/browse/HDFS-7360
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Haohui Mai
Assignee: Zhanwei Wang
Priority: Critical
 Attachments: HDFS-7360-pnative.002.patch, 
 HDFS-7360-pnative.003.patch, HDFS-7360-pnative.004.patch, HDFS-7360.patch


 Currently the branch has enough code to interact with HDFS servers. We should 
 test the code against MiniDFSCluster to ensure the correctness of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2015-03-02 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343996#comment-14343996
 ] 

Thanh Do commented on HDFS-7188:


Thanks! Can you commit this patch? I'll wait for this to commit before working 
on other issues to reduce the patch size so that we can see the diff more 
clearly.

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch, HDFS-7188-branch-HDFS-6994-2.patch, 
 HDFS-7188-branch-HDFS-6994-3.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7360) Test libhdfs3 against MiniDFSCluster

2015-03-02 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344132#comment-14344132
 ] 

Thanh Do commented on HDFS-7360:


Hi [~cmccabe] and [~wangzw]. 

Along the line of testing, I just wonder how should we test Windows related 
code? Does Hadoop support continuous integration for Windows?

 Test libhdfs3 against MiniDFSCluster
 

 Key: HDFS-7360
 URL: https://issues.apache.org/jira/browse/HDFS-7360
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Haohui Mai
Assignee: Zhanwei Wang
Priority: Critical
 Attachments: HDFS-7360-pnative.002.patch, HDFS-7360.patch


 Currently the branch has enough code to interact with HDFS servers. We should 
 test the code against MiniDFSCluster to ensure the correctness of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7870) remove libuuid dependency

2015-03-02 Thread Thanh Do (JIRA)

Thanh Do created HDFS-7870:
--

 Summary: remove libuuid dependency
 Key: HDFS-7870
 URL: https://issues.apache.org/jira/browse/HDFS-7870
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Thanh Do


Instead of using the platform-dependent libuuid, we should have our own 128-bit 
random generator and use this across all platforms we support. This will avoid 
the headache of platform dependent code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2015-03-02 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343727#comment-14343727
 ] 

Thanh Do commented on HDFS-7188:


bq. If this is too much work, then we could move it to a follow-on. But it 
sounds like you've already got the code to make it work without libuuid, so 
what else do we need?

Unfortunately, the code in the windows side is platform-specific and won't work 
with POSIX. I have created a follow-up JIRA for this in HDFS-7870

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch, HDFS-7188-branch-HDFS-6994-2.patch, 
 HDFS-7188-branch-HDFS-6994-3.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7860) Get HA NameNode information from config file

2015-02-27 Thread Thanh Do (JIRA)

Thanh Do created HDFS-7860:
--

 Summary: Get HA NameNode information from config file
 Key: HDFS-7860
 URL: https://issues.apache.org/jira/browse/HDFS-7860
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Thanh Do


In current code, client uses files under /tmp to determine NameNode HA 
information. We should follow a cleaner approach that gets this information 
from configuration file (similar to Java client)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7861) Revisit Windows socket API compatibility

2015-02-27 Thread Thanh Do (JIRA)

Thanh Do created HDFS-7861:
--

 Summary: Revisit Windows socket API compatibility 
 Key: HDFS-7861
 URL: https://issues.apache.org/jira/browse/HDFS-7861
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Thanh Do


Windows socket API is somewhat different from its POSIX counter part (as 
described here: http://tangentsoft.net/wskfaq/articles/bsd-compatibility.html). 
We should address the compatibility issue in this JIRA. For instance, in 
Windows, {{WSAStartup}} should be called before any other socket APIs for the 
APIs to work correctly. Moreover, as Winsock API does not return error code in 
{{errno}} variables, {{perror}} does not work as in Posix systems. We should 
use {{WSAGetLastErrorMessage}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7862) Revisit the use of long data type

2015-02-27 Thread Thanh Do (JIRA)

Thanh Do created HDFS-7862:
--

 Summary: Revisit the use of long data type
 Key: HDFS-7862
 URL: https://issues.apache.org/jira/browse/HDFS-7862
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Thanh Do


We should revisit the places where {{long}} data type is used. In posix, 
{{long}} takes 4 bytes in 32 bit architecture and 8 bytes in 64 bit. However, 
in Windows, {{long}} takes 4 bytes no matter what. Because of this, compilation 
in Windows could finish successfully, but some tests might fail. Additionally 
compilation in windows will generate many warnings such as conversion from 
'uint64_t' to 'unsigned long', possible loss of data.

We should stick with using {{int64_t}} or {{uint64_t}} instead whenever we 
expect the variables are signed or unsigned 8-byte long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7188) support build libhdfs3 on windows

2015-02-27 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7188:
---
Attachment: HDFS-7188-branch-HDFS-6994-3.patch

Hi [~cmccabe]. Attached is my next patch which addresses your comments.

For {{uuid}} problem, I redefine the {{uuid_t}} to make it similar to its 
libuuid counter part and use reinterpret_cast to convert {{uuid_t}} to native 
windows type. This is OK because UUID in windows is alsow 16 byte long. 
However, I agree that we should write our own uuid generator to avoid 
yet-another dependency and cross-platform headache.

I've also created some follow up JIRA (e.g., HDFS-7860, HDFS-7861, and 
HDFS-7862). Let me know what you think and if I need to make other changes to 
get this patch in. Thanks.

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch, HDFS-7188-branch-HDFS-6994-2.patch, 
 HDFS-7188-branch-HDFS-6994-3.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7817) libhdfs3: fix strerror_r detection

2015-02-26 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339781#comment-14339781
 ] 

Thanh Do commented on HDFS-7817:


+1 for this. Will submit a patch soon

 libhdfs3: fix strerror_r detection
 --

 Key: HDFS-7817
 URL: https://issues.apache.org/jira/browse/HDFS-7817
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Colin Patrick McCabe

 The signature of strerror_r is not quite detected correctly in libhdfs3.  The 
 code assumes that {{int foo = strerror_r}} will fail to compile with the GNU 
 type signature, but this is not the case (C\+\+ will coerce the char* to an 
 int in this case).  Instead, we should do what the libhdfs {{terror}} 
 (threaded error) function does here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7817) libhdfs3: fix strerror_r detection

2015-02-26 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do reassigned HDFS-7817:
--

Assignee: Thanh Do

 libhdfs3: fix strerror_r detection
 --

 Key: HDFS-7817
 URL: https://issues.apache.org/jira/browse/HDFS-7817
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Colin Patrick McCabe
Assignee: Thanh Do

 The signature of strerror_r is not quite detected correctly in libhdfs3.  The 
 code assumes that {{int foo = strerror_r}} will fail to compile with the GNU 
 type signature, but this is not the case (C\+\+ will coerce the char* to an 
 int in this case).  Instead, we should do what the libhdfs {{terror}} 
 (threaded error) function does here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7817) libhdfs3: fix strerror_r detection

2015-02-24 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335048#comment-14335048
 ] 

Thanh Do commented on HDFS-7817:


Hi [~cmccabe]. Thanks for pointing out the code. I was grepping the 
{{hadoop-hdfs}} folder but not {{hadoop-common}}. 

So this Jira is about using {{sys_errlist}} instead of {{strerror_r}} for 
libhdfs3 right?

 libhdfs3: fix strerror_r detection
 --

 Key: HDFS-7817
 URL: https://issues.apache.org/jira/browse/HDFS-7817
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Colin Patrick McCabe

 The signature of strerror_r is not quite detected correctly in libhdfs3.  The 
 code assumes that {{int foo = strerror_r}} will fail to compile with the GNU 
 type signature, but this is not the case (C\+\+ will coerce the char* to an 
 int in this case).  Instead, we should do what the libhdfs {{terror}} 
 (threaded error) function does here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2015-02-20 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329528#comment-14329528
 ] 

Thanh Do commented on HDFS-7188:


Hi [~cmccabe]. Do you plan to take a closer look at my current patch or you 
will wait for the next patch? I prefer the former plan because I can integrate 
as many comments as possible for the next iteration.

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch, HDFS-7188-branch-HDFS-6994-2.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7817) libhdfs3: fix strerror_r detection

2015-02-20 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329873#comment-14329873
 ] 

Thanh Do commented on HDFS-7817:


Hey [~cmccabe], can you point me to the {{terror}} in libhdfs? I greped for 
this name but couldn't find it. I would like to take a crack at this if you 
don't mind. Thanks!

 libhdfs3: fix strerror_r detection
 --

 Key: HDFS-7817
 URL: https://issues.apache.org/jira/browse/HDFS-7817
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Colin Patrick McCabe

 The signature of strerror_r is not quite detected correctly in libhdfs3.  The 
 code assumes that {{int foo = strerror_r}} will fail to compile with the GNU 
 type signature, but this is not the case (C\+\+ will coerce the char* to an 
 int in this case).  Instead, we should do what the libhdfs {{terror}} 
 (threaded error) function does here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7019) Add unit test for libhdfs3

2015-02-19 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327928#comment-14327928
 ] 

Thanh Do commented on HDFS-7019:


Oh, my bad. I didn't realize that there was a patch for this.

Anyway, some of the existing tests will fail with the windows build support, 
hence they need be change accordingly. I'll open another Jira for this.

 Add unit test for libhdfs3
 --

 Key: HDFS-7019
 URL: https://issues.apache.org/jira/browse/HDFS-7019
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
 Attachments: HDFS-7019.patch


 Add unit test for libhdfs3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-7768) Separate Platform specific funtions

2015-02-19 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do resolved HDFS-7768.

Resolution: Invalid

Overlapped with HDFS-7188

 Separate Platform specific funtions
 ---

 Key: HDFS-7768
 URL: https://issues.apache.org/jira/browse/HDFS-7768
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do

 Current code has several platform-specific parts (e.g., get environment 
 variables, get local addresses, print stack). We should separate these parts 
 into platform folders.
 This issue will do just that. Posix systems will be able to compile 
 successfully. Windows will fail to compile due to unimplemented parts. The 
 implementation for the Windows parts will be handle at HDFS-7188 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2015-02-18 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326634#comment-14326634
 ] 

Thanh Do commented on HDFS-7188:


{code}
if (syscalls::getpeername(sock, peer, 
reinterpret_castint*(len))) {
{code}
Good catch. I will use {{socklen_t}} for {{len}} in the next patch.

{code}
49  #ifdef _WIN32
50  memcpy(clientId[0], id, sizeof(uuid_t));
51  #else
52  memcpy(clientId[0], id, sizeof(uuid_t));
53  #endif
{code}

The reason for this is that in Windows, {{uuid_t}} is defined differently from 
in Linux. In particular, in windows {{uuid_t}} is a real struct:
{code}
typedef struct _GUID {
  unsigned long Data1;
  unsigned short Data2;
  unsigned short Data3;
  unsigned char Data4[8];
} GUID;
typedef uuid_t GUID;
{code}

while in linux, it is defined as char array:
{code}
typedef unsigned char uuid_t[16];
{code}

Fortunately, size of {{uuid_t}} in both platforms are 16 bytes.

{{GetInitNamenodeIndex}}: my patch did have this function defined in 
{{os/windows/platform.cc}}. I agree that getting it from the configuration is a 
cleaner and preferable way. But for the scope if this JIRA, I just want to get 
the windows build in asap. We should definitely open follow-on JIRAs to address 
dangling issues. There are few dangling issues I could think of: 

1) get HA info from configuration, 

2) revisit socket error handling code for windows because socket error code is 
somewhat different between windows and posix (e.g., perror() exists but does 
not work as expected), 

3) bootstrap windows socket api with WSAStartup.

and so on.

What do you think? 

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch, HDFS-7188-branch-HDFS-6994-2.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7019) Add unit test for libhdfs3

2015-02-17 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325098#comment-14325098
 ] 

Thanh Do commented on HDFS-7019:


Hi [~wangzw],

Is there a specific reason that we can not use existing unit tests that you 
already wrote?

 Add unit test for libhdfs3
 --

 Key: HDFS-7019
 URL: https://issues.apache.org/jira/browse/HDFS-7019
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
 Attachments: HDFS-7019.patch


 Add unit test for libhdfs3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7188) support build libhdfs3 on windows

2015-02-13 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7188:
---
Attachment: HDFS-7188-branch-HDFS-6994-2.patch

Hi folks,

Attached is a next patch which allows libhdfs3 to build successfully in 
Windows. At the high level, this patch does the following:

1. Separates platform-specific code parts (parts that need rewriting instead of 
using {{define}} tricks) and put them in os/platform/ folders. 

2. Always enables boost in Windows because Visual Studio 2010 does not support 
some features such as atomic and chrono.

Please give me some feedback so that I will work on the next iteration asap.

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch, HDFS-7188-branch-HDFS-6994-2.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6994) libhdfs3 - A native C/C++ HDFS client

2015-02-10 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315481#comment-14315481
 ] 

Thanh Do commented on HDFS-6994:


Following up, VC 2010 ( and later) do not support nested exception yet. So it 
won't be able to understand methods such as {{std::throw_with_nested}} or 
{{std::rethrow_if_nested}}

 libhdfs3 - A native C/C++ HDFS client
 -

 Key: HDFS-6994
 URL: https://issues.apache.org/jira/browse/HDFS-6994
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Zhanwei Wang
 Attachments: HDFS-6994-rpc-8.patch, HDFS-6994.patch


 Hi All
 I just got the permission to open source libhdfs3, which is a native C/C++ 
 HDFS client based on Hadoop RPC protocol and HDFS Data Transfer Protocol.
 libhdfs3 provide the libhdfs style C interface and a C++ interface. Support 
 both HADOOP RPC version 8 and 9. Support Namenode HA and Kerberos 
 authentication.
 libhdfs3 is currently used by HAWQ of Pivotal
 I'd like to integrate libhdfs3 into HDFS source code to benefit others.
 You can find libhdfs3 code from github
 https://github.com/PivotalRD/libhdfs3
 http://pivotalrd.github.io/libhdfs3/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7768) Separate Platform specific funtions

2015-02-10 Thread Thanh Do (JIRA)

Thanh Do created HDFS-7768:
--

 Summary: Separate Platform specific funtions
 Key: HDFS-7768
 URL: https://issues.apache.org/jira/browse/HDFS-7768
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Thanh Do
Assignee: Thanh Do


Current code has several platform-specific parts (e.g., get environment 
variables, get local addresses, print stack). We should separate these parts 
into platform folders.

This issue will do just that. Posix systems will be able to compile 
successfully. Windows will fail to compile due to unimplemented parts. The 
implementation for the Windows parts will be handle at HDFS-7188 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (HDFS-7768) Separate Platform specific funtions

2015-02-10 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7768 started by Thanh Do.
--
 Separate Platform specific funtions
 ---

 Key: HDFS-7768
 URL: https://issues.apache.org/jira/browse/HDFS-7768
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do

 Current code has several platform-specific parts (e.g., get environment 
 variables, get local addresses, print stack). We should separate these parts 
 into platform folders.
 This issue will do just that. Posix systems will be able to compile 
 successfully. Windows will fail to compile due to unimplemented parts. The 
 implementation for the Windows parts will be handle at HDFS-7188 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6994) libhdfs3 - A native C/C++ HDFS client

2015-02-10 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315238#comment-14315238
]

Thanh Do commented on HDFS-6994:

Hi folks,

As the existing native code in Hadoop targets Visual Studio 2010, I would like
to know what c\+\+ standard do we support? C\+\+11 or earlier? While porting
this library to windows (HDFS-7188), I find out that Windows build must support
Windows SDK or Visual Studio 2010 Professional (not VS 2012). And Visual Studio
2010 does not support C\+\+11.

Any thoughts on this? It seems some of the code in branch follows C\+\+11
standard already.

libhdfs3 - A native C/C++ HDFS client
-

Key: HDFS-6994
URL: https://issues.apache.org/jira/browse/HDFS-6994
Project: Hadoop HDFS
Issue Type: New Feature
Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Zhanwei Wang
Attachments: HDFS-6994-rpc-8.patch, HDFS-6994.patch

Hi All
I just got the permission to open source libhdfs3, which is a native C/C++
HDFS client based on Hadoop RPC protocol and HDFS Data Transfer Protocol.
libhdfs3 provide the libhdfs style C interface and a C++ interface. Support
both HADOOP RPC version 8 and 9. Support Namenode HA and Kerberos
authentication.
libhdfs3 is currently used by HAWQ of Pivotal
I'd like to integrate libhdfs3 into HDFS source code to benefit others.
You can find libhdfs3 code from github
https://github.com/PivotalRD/libhdfs3
http://pivotalrd.github.io/libhdfs3/

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7577) Add additional headers that includes need by Windows

2015-01-29 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297336#comment-14297336
 ] 

Thanh Do commented on HDFS-7577:


Thanks Colin. I'll work on the next patch soon. Best!

 Add additional headers that includes need by Windows
 

 Key: HDFS-7577
 URL: https://issues.apache.org/jira/browse/HDFS-7577
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do
 Fix For: HDFS-6994

 Attachments: HDFS-7577-branch-HDFS-6994-0.patch, 
 HDFS-7577-branch-HDFS-6994-1.patch, HDFS-7577-branch-HDFS-6994-2.patch


 This jira involves adding a list of (mostly dummy) headers that available in 
 POSIX systems, but not in Windows. One step towards making libhdfs3 built in 
 Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7577) Add additional headers that includes need by Windows

2015-01-26 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291908#comment-14291908
 ] 

Thanh Do commented on HDFS-7577:


Hi [~cmccabe]. Could please you take a look at the new patch? I really like to 
get this in so that I can start the next patch, which depends on this one. 
Thank you.

 Add additional headers that includes need by Windows
 

 Key: HDFS-7577
 URL: https://issues.apache.org/jira/browse/HDFS-7577
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7577-branch-HDFS-6994-0.patch, 
 HDFS-7577-branch-HDFS-6994-1.patch


 This jira involves adding a list of (mostly dummy) headers that available in 
 POSIX systems, but not in Windows. One step towards making libhdfs3 built in 
 Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7577) Add additional headers that includes need by Windows

2015-01-26 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7577:
---
Attachment: HDFS-7577-branch-HDFS-6994-2.patch

Hi [~cmccabe]. Attached is another patch that throws an error if libhdfs3 is 
compiled in Windows with a non-x86 processor.

 Add additional headers that includes need by Windows
 

 Key: HDFS-7577
 URL: https://issues.apache.org/jira/browse/HDFS-7577
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7577-branch-HDFS-6994-0.patch, 
 HDFS-7577-branch-HDFS-6994-1.patch, HDFS-7577-branch-HDFS-6994-2.patch


 This jira involves adding a list of (mostly dummy) headers that available in 
 POSIX systems, but not in Windows. One step towards making libhdfs3 built in 
 Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7577) Add additional headers that includes need by Windows

2015-01-21 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7577:
---
Attachment: HDFS-7577-branch-HDFS-6994-1.patch

Good catch! {{os/windows/cupid.h}} is x86-specific. Attached is another patch 
which addresses this.

 Add additional headers that includes need by Windows
 

 Key: HDFS-7577
 URL: https://issues.apache.org/jira/browse/HDFS-7577
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7577-branch-HDFS-6994-0.patch, 
 HDFS-7577-branch-HDFS-6994-1.patch


 This jira involves adding a list of (mostly dummy) headers that available in 
 POSIX systems, but not in Windows. One step towards making libhdfs3 built in 
 Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2015-01-18 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281873#comment-14281873
 ] 

Thanh Do commented on HDFS-7188:


Hi Folks. I've submitted a patch to add necessary headers needed by Windows in 
HDFS-7577. Can somebody take a look and give comment? Thanks.

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7577) Add additional headers that includes need by Windows

2015-01-12 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7577:
---
Attachment: HDFS-7577-branch-HDFS-6994-0.patch

Hi [~cmccabe]! 

Attached is the first patch for this issue. It contains following changes.

1. put all Cmake settings in {{build.h.in}}

2. separate platform specific stuff into either {{posix/platform.h}} or 
{{windows/platform.h}}

3. change to {{src/CMakeLists.txt}} to include correct {{platform.h}}

4. add bunch of mostly dummy headers that are not available in Windows. This 
will avoid putting scattered {{#ifdef}} across the source files.

This patch should reduce the number of compilation errors (e.g., missing 
header, undefined function) for Windows build significantly. Once this patch 
gets checked in, I will work on rewriting some posix-dependent part that cannot 
be fixed by simply redefining the function signatures.

 Add additional headers that includes need by Windows
 

 Key: HDFS-7577
 URL: https://issues.apache.org/jira/browse/HDFS-7577
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7577-branch-HDFS-6994-0.patch


 This jira involves adding a list of (mostly dummy) headers that available in 
 POSIX systems, but not in Windows. One step towards making libhdfs3 built in 
 Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (HDFS-7577) Add additional headers that includes need by Windows

2015-01-12 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7577 started by Thanh Do.
--
 Add additional headers that includes need by Windows
 

 Key: HDFS-7577
 URL: https://issues.apache.org/jira/browse/HDFS-7577
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Thanh Do
Assignee: Thanh Do

 This jira involves adding a list of (mostly dummy) headers that available in 
 POSIX systems, but not in Windows. One step towards making libhdfs3 built in 
 Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7574) Make cmake work in Windows Visual Studio 2010

2015-01-09 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7574:
---
Attachment: HDFS-7574-branch-HDFS-6994-2.patch

Attached is another patch that addresses the strerror test. I simply put the 
ifdef in the test file to make it actually compile on Windows.

[~cmccabe], please give your comment.

 Make cmake work in Windows Visual Studio 2010
 -

 Key: HDFS-7574
 URL: https://issues.apache.org/jira/browse/HDFS-7574
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows Visual Studio 2010
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7574-branch-HDFS-6994-1.patch, 
 HDFS-7574-branch-HDFS-6994-2.patch


 Cmake should be able to generate a solution file in Windows Visual Studio 
 2010. This is the first step in a series of steps making libhdfs3 built 
 successfully in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7574) Make cmake work in Windows Visual Studio 2010

2015-01-06 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266247#comment-14266247
 ] 

Thanh Do commented on HDFS-7574:


Hi [~cmccabe]. In Windows, the existing test (in 
{{CMakeTestCompileStrerror.cpp}}) won't work because {{strerror}} has different 
signature. Specifically, Windows does not have {{strerror_r(errorno, buf, 
len)}}. The equivalence is {{strerror_s(buf, len, errorno)}}, with different 
parameter order. This make the test fails and {{STRERROR_R_RETURN_INT}} is 
always equal {{NO}}.

A cleaner fix may be put a few lines in {{CMakeTestCompileStrerror}}:
{code}
#ifdef _WIN32
#define strerror_r(errnum, buf, buflen) strerror_s((buf), (buflen), (errnum))
#endif
{code}

Thoughts?



 Make cmake work in Windows Visual Studio 2010
 -

 Key: HDFS-7574
 URL: https://issues.apache.org/jira/browse/HDFS-7574
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows Visual Studio 2010
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7574-branch-HDFS-6994-1.patch


 Cmake should be able to generate a solution file in Windows Visual Studio 
 2010. This is the first step in a series of steps making libhdfs3 built 
 successfully in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7574) Make cmake work in Windows Visual Studio 2010

2014-12-30 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7574:
---
Attachment: HDFS-7574-branch-HDFS-6994-0.patch

Attached is a simple patch that allows cmake to generate a solution file in 
Visual Studio 2010. 

 Make cmake work in Windows Visual Studio 2010
 -

 Key: HDFS-7574
 URL: https://issues.apache.org/jira/browse/HDFS-7574
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows Visual Studio 2010
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7574-branch-HDFS-6994-0.patch


 Cmake should be able to generate a solution file in Windows Visual Studio 
 2010. This is the first step in a series of steps making libhdfs3 built 
 successfully in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7577) Add additional headers that includes need by Windows

2014-12-30 Thread Thanh Do (JIRA)

Thanh Do created HDFS-7577:
--

 Summary: Add additional headers that includes need by Windows
 Key: HDFS-7577
 URL: https://issues.apache.org/jira/browse/HDFS-7577
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Thanh Do
Assignee: Thanh Do


This jira involves adding a list of (mostly dummy) headers that available in 
POSIX systems, but not in Windows. One step towards making libhdfs3 built in 
Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7574) Make cmake work in Windows Visual Studio 2010

2014-12-30 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261233#comment-14261233
 ] 

Thanh Do commented on HDFS-7574:


Could somebody please review this patch? Once it gets in, I can start 
submitting subsequent patches for issues such as HDFS-7577. Thanks!

 Make cmake work in Windows Visual Studio 2010
 -

 Key: HDFS-7574
 URL: https://issues.apache.org/jira/browse/HDFS-7574
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows Visual Studio 2010
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7574-branch-HDFS-6994-0.patch


 Cmake should be able to generate a solution file in Windows Visual Studio 
 2010. This is the first step in a series of steps making libhdfs3 built 
 successfully in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7574) Make cmake work in Windows Visual Studio 2010

2014-12-30 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7574:
---
Attachment: HDFS-7574-branch-HDFS-6994-1.patch

 Make cmake work in Windows Visual Studio 2010
 -

 Key: HDFS-7574
 URL: https://issues.apache.org/jira/browse/HDFS-7574
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows Visual Studio 2010
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7574-branch-HDFS-6994-1.patch


 Cmake should be able to generate a solution file in Windows Visual Studio 
 2010. This is the first step in a series of steps making libhdfs3 built 
 successfully in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7574) Make cmake work in Windows Visual Studio 2010

2014-12-30 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7574:
---
Attachment: (was: HDFS-7574-branch-HDFS-6994-0.patch)

 Make cmake work in Windows Visual Studio 2010
 -

 Key: HDFS-7574
 URL: https://issues.apache.org/jira/browse/HDFS-7574
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows Visual Studio 2010
Reporter: Thanh Do
Assignee: Thanh Do
 Attachments: HDFS-7574-branch-HDFS-6994-1.patch


 Cmake should be able to generate a solution file in Windows Visual Studio 
 2010. This is the first step in a series of steps making libhdfs3 built 
 successfully in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-12-30 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261555#comment-14261555
]

Thanh Do commented on HDFS-7188:

Hi [~cmccabe]. Let me clarify about the small jiras that I've thought of (their
order, what they do, and what is the outcome).

The overall goal is that once all of these JIRA get in, libhdfs3 can be built
and run successfully in Windows Visual Studio. However, each individual JIRA
will not guarantee this big goal. Rather, it serves as a step toward the
overall goal. One requirement for each JIRA is that it should not break build
in Linux or Mac.

With this overview, here is the list of the proposed JIRAs.

# make cmake generate a solution file for VS 2010 (HDFS-7574). This JIRA only
contains changes for Cmakelist files. _Outcome_: running cmake -G Visual
Studio 10 2010 will generate a solution file, loadable by VS 2010. Of course,
build in Windows will not be successful.
# add additional headers needed by Windows (HDFS-7577). This JIRA contains two
set of changes: (a) missing dummy headers file in Windows, and (b) cmake
changes to add header dirs. _Outcome_: build in Windows will still fail (with
smaller number of errors though, because now the missing headers are there).
# restructure the platform specific functions. The goal here is to make POSIX
specific code (e.g., in logging, stack printer, get local network address)
platform aware. Some examples would be {{platform_vsnprintf}} and
{{GetAdaptersAddresses}}, as you mentioned above. _Outcome_: build will success
in Windows, but the library will not function correctly, because Windows
counterpart are only placeholders.
# Implement platform specific functions in Windows. This JIRA simply fills in
those placeholders in #3 with the large chunks of Windows-specific code I
already have. _Outcome_: libhdfs3 now can be built successfully _and_ run
correctly.

Please let me know your thoughts.

support build libhdfs3 on windows
-

Key: HDFS-7188
URL: https://issues.apache.org/jira/browse/HDFS-7188
Project: Hadoop HDFS
Issue Type: Sub-task
Components: hdfs-client
Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
Attachments: HDFS-7188-branch-HDFS-6994-0.patch,
HDFS-7188-branch-HDFS-6994-1.patch

libhdfs3 should work on windows

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-12-30 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261569#comment-14261569
 ] 

Thanh Do commented on HDFS-7188:


Hi [~cmccabe]. Glad that we are on the same pages :). Could you review 
HDFS-7574? It is the JIRA #1. Once that gets in, I can submit patches for 
subsequent JIRAs. 

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-12-29 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260493#comment-14260493
 ] 

Thanh Do commented on HDFS-7188:


Thanks for your comment, ~cmccabe.

README.txt is for my own notes, it was included by mistake. Same thing for 
krb5_32. These changes are not supposed to be in this patch. Sorry about that.

Regarding mman library, I think the code is MIT licence, but it doesn't hurt to 
rewrite this. 

Now, I am convinced that we should break this into small jiras. Few I could 
think of.
1. add additional header includes needed by Windows.
2. make cmake works on Windows Visual Studio 2010.
3. restructure platform specific functions.
4. Implement platform specific functions for Windows.

Thoughts?

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch, 
 HDFS-7188-branch-HDFS-6994-1.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7574) Make cmake work in Windows Visual Studio 2010

2014-12-29 Thread Thanh Do (JIRA)

Thanh Do created HDFS-7574:
--

 Summary: Make cmake work in Windows Visual Studio 2010
 Key: HDFS-7574
 URL: https://issues.apache.org/jira/browse/HDFS-7574
 Project: Hadoop HDFS
  Issue Type: Sub-task
 Environment: Windows Visual Studio 2010
Reporter: Thanh Do
Assignee: Thanh Do


Cmake should be able to generate a solution file in Windows Visual Studio 2010. 
This is the first step in a series of steps making libhdfs3 built successfully 
in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (HDFS-7574) Make cmake work in Windows Visual Studio 2010

2014-12-29 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7574 started by Thanh Do.
--
 Make cmake work in Windows Visual Studio 2010
 -

 Key: HDFS-7574
 URL: https://issues.apache.org/jira/browse/HDFS-7574
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows Visual Studio 2010
Reporter: Thanh Do
Assignee: Thanh Do

 Cmake should be able to generate a solution file in Windows Visual Studio 
 2010. This is the first step in a series of steps making libhdfs3 built 
 successfully in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7188) support build libhdfs3 on windows

2014-12-23 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thanh Do updated HDFS-7188:
---
Attachment: HDFS-7188-branch-HDFS-6994-1.patch

Hi folks.

Attached is the next patch with an significant code restructure, compared with
the previous one.

In this iteration, I have removed as many ifdef as possible. Most the Windows
changes are moved to a single folder (os/windows), which contains mostly dummy
headers and some equivalent implementation for Windows environments (e.g.,
getcpuid, mman)

Nevertheless, there are some parts that I still need to use ifdef and rewrite a
significant amount of code.

1. TCP part (e.g., TCP channel, InputStreamImpl, syscall). There is some hairy
difference between Windows socket API and its POSIX counter parts such as
different API and error code. To make the matter worse, it seems like some
Winsock does not propagate its error codes in the errno variable.

2. Parsing Kerberos principal. It is hard to find one-to-one mapping for
regular expression processing in Windows. Same thing for get user info. Thus, I
decided to rewrite the entire logic.

3. Logger code part. Since that is no gettimeofday() and dprintf equivalences
in Windows, I have to rewrite this code part..

4. Stack printer code part.

5. Thread and signal (Thread.cc). There are some differences for Windows signal.

6. NameNodeProxy. Current implementation requires an index file in /tmp,
which is not available in Windows. Therefore, I rewrote that code part entirely.

Folks, please give some feedback so that I can work on the next iteration.

support build libhdfs3 on windows
-

libhdfs3 should work on windows

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-12-05 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14235607#comment-14235607
 ] 

Thanh Do commented on HDFS-7188:


Hi Chris,

I'll take a closer look in HDFS-573 and produce another (hopefully) cleaner 
patch for this.



 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7188) support build libhdfs3 on windows

2014-12-02 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7188:
---
Attachment: HDFS-7188-branch-HDFS-6994-0.patch

Attached is a preliminary patch that supports building libhdfs3 in Windows with 
Visual Studio 2010.

At a high level, it contains:
- changes to cmake files to generate a solution file for VS 2010
- a bunch of ifdef that to make VS compile the solution


 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-12-02 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232090#comment-14232090
 ] 

Thanh Do commented on HDFS-7188:


Hi Colin,

Thanks for your feedback and agree with all your points.

I'll work on a next patch that remove as many #ifdefs as possible, and move 
them all to platform.h. However, not all of the ifdef can be fixed by 
redefining the function name because of various reasons such as different 
parameter orders, different semantics (e.g., vsnprintf), and unavailability in 
Windows (e.g., fncntl, posix_madvise). We will see how much we can get a way 
with not using ifdef.

Regarding real user ID, I agree that we should get rid of it for both Windows 
and Linux.


 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do
 Attachments: HDFS-7188-branch-HDFS-6994-0.patch


 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7188) support build libhdfs3 on windows

2014-12-01 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7188:
---
Environment: Windows System, Visual Studio 2010

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do

 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (HDFS-7188) support build libhdfs3 on windows

2014-12-01 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7188 started by Thanh Do.
--
 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
 Environment: Windows System, Visual Studio 2010
Reporter: Zhanwei Wang
Assignee: Thanh Do

 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-11-24 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223164#comment-14223164
 ] 

Thanh Do commented on HDFS-7188:


Hi guys. I have a question about resolving/getting dependencies. 

As you know, libhdfs3 depends on several external components (e.g., libxml2, 
kerberos, and boost). My current assumption when porting this is that the 
windows machine where the compilation is run already has all of these 
dependencies installed. Does this sound right? Or we should create a script to 
download all dependencies and install them if needed be?

Personally, I would go with my current assumption first, because it will allow 
me to proceed with the code change. 

Thoughts?

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Thanh Do

 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6994) libhdfs3 - A native C/C++ HDFS client

2014-10-30 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190059#comment-14190059
]

Thanh Do commented on HDFS-6994:

Hey guys, thanks for all the helpful pointers.

My work is definitely similar to HDFS-573. The changes so far include:

1) add a bunch of #ifdef across many files
2) rewrite several functions that use POSIX APIs that are not available in
Windows, making them use equivalent Windows system call.
3) hack CMake to create a Visual Studio Solution in Windows.
4) add small changes to unit tests (as I've replaced some POSIX APIs with
Windows counterpart)

As in HDFS-573, the changes are scattered but I think it would be reasonable to
bundle them all in a single patch. Thoughts? And one more thing, since I
started my change by the time there was no subtask but comments on this JIRA,
I've been modifying the github version in http://pivotalrd.github.io/libhdfs3/,
and I guess I will have to start with/propagate my change to the code in trunk
now, is that correct?

libhdfs3 - A native C/C++ HDFS client
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6994) libhdfs3 - A native C/C++ HDFS client

2014-10-30 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190621#comment-14190621
]

Thanh Do commented on HDFS-6994:

[~cnauroth]. Thanks for your quick reply.

Some additional newbie questions before I start working :). I see that
currently there are many concurrent subtasks. Is there any dependency among
them? In other word, should I wait until some certain subtasks get resolved, or
can I start checking out the code and make modification and resolve any
conflict later? Finally, Should I submit my patch under HDFS-7188 once I am
done?

libhdfs3 - A native C/C++ HDFS client
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7188) support build libhdfs3 on windows

2014-10-30 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do reassigned HDFS-7188:
--

Assignee: Thanh Do

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Thanh Do

 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-10-30 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190874#comment-14190874
 ] 

Thanh Do commented on HDFS-7188:


[~cnauroth], cool! I'll keep this in mind and start rolling

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Thanh Do

 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6994) libhdfs3 - A native C/C++ HDFS client

2014-10-29 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188513#comment-14188513
 ] 

Thanh Do commented on HDFS-6994:


Guys! I have been following JIRA for several years but have never contributed a 
single line of code. This is my first time so there will be lots of newbie 
questions, sorry for that! 

Is there an instruction somewhere I could follow to get started? Thanks!

 libhdfs3 - A native C/C++ HDFS client
 -

 Key: HDFS-6994
 URL: https://issues.apache.org/jira/browse/HDFS-6994
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Zhanwei Wang
 Attachments: HDFS-6994-rpc-8.patch, HDFS-6994.patch


 Hi All
 I just got the permission to open source libhdfs3, which is a native C/C++ 
 HDFS client based on Hadoop RPC protocol and HDFS Data Transfer Protocol.
 libhdfs3 provide the libhdfs style C interface and a C++ interface. Support 
 both HADOOP RPC version 8 and 9. Support Namenode HA and Kerberos 
 authentication.
 libhdfs3 is currently used by HAWQ of Pivotal
 I'd like to integrate libhdfs3 into HDFS source code to benefit others.
 You can find libhdfs3 code from github
 https://github.com/PivotalRD/libhdfs3
 http://pivotalrd.github.io/libhdfs3/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-10-28 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187407#comment-14187407
 ] 

Thanh Do commented on HDFS-7188:


I've been working on porting libhdfs3 to Windows (Visual Studio 2013) and 
closed to finish. I am happy to share this version with the community if 
somebody interested in.

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang

 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7188) support build libhdfs3 on windows

2014-10-28 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do reassigned HDFS-7188:
--

Assignee: Thanh Do

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Thanh Do

 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7188) support build libhdfs3 on windows

2014-10-28 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-7188:
---
Assignee: (was: Thanh Do)

 support build libhdfs3 on windows
 -

 Key: HDFS-7188
 URL: https://issues.apache.org/jira/browse/HDFS-7188
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Zhanwei Wang

 libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6994) libhdfs3 - A native C/C++ HDFS client

2014-10-28 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187473#comment-14187473
 ] 

Thanh Do commented on HDFS-6994:


Hi, I've been porting libhdfs3 to Windows Visual Studio 2013 and would like to 
contribute my effort back to the community. Should this be under HDFS-7188?


 libhdfs3 - A native C/C++ HDFS client
 -

 Key: HDFS-6994
 URL: https://issues.apache.org/jira/browse/HDFS-6994
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs-client
Reporter: Zhanwei Wang
Assignee: Zhanwei Wang
 Attachments: HDFS-6994-rpc-8.patch, HDFS-6994.patch


 Hi All
 I just got the permission to open source libhdfs3, which is a native C/C++ 
 HDFS client based on Hadoop RPC protocol and HDFS Data Transfer Protocol.
 libhdfs3 provide the libhdfs style C interface and a C++ interface. Support 
 both HADOOP RPC version 8 and 9. Support Namenode HA and Kerberos 
 authentication.
 libhdfs3 is currently used by HAWQ of Pivotal
 I'd like to integrate libhdfs3 into HDFS source code to benefit others.
 You can find libhdfs3 code from github
 https://github.com/PivotalRD/libhdfs3
 http://pivotalrd.github.io/libhdfs3/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6009) Tools based on favored node feature for isolation

2014-03-14 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935194#comment-13935194
]

Thanh Do commented on HDFS-6009:

Yu Li, thanks for your detailed comment! Your use case is a great example of
isolation. We are currently working on some similar problems but at a lower
level on the software stack, thus your use case is a great motivation.

Tools based on favored node feature for isolation
-

Key: HDFS-6009
URL: https://issues.apache.org/jira/browse/HDFS-6009
Project: Hadoop HDFS
Issue Type: Task
Affects Versions: 2.3.0
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor

There're scenarios like mentioned in HBASE-6721 and HBASE-4210 that in
multi-tenant deployments of HBase we prefer to specify several groups of
regionservers to serve different applications, to achieve some kind of
isolation or resource allocation. However, although the regionservers are
grouped, the datanodes which store the data are not, which leads to the case
that one datanode failure affects multiple applications, as we already
observed in our product environment.
To relieve the above issue, we could take usage of the favored node feature
(HDFS-2576) to make regionserver able to locate data within its group, or say
make datanodes also grouped (passively), to form some level of isolation.
In this case, or any other case that needs datanodes to group, we would need
a bunch of tools to maintain the group, including:
1. Making balancer able to balance data among specified servers, rather than
the whole set
2. Set balance bandwidth for specified servers, rather than the whole set
3. Some tool to check whether the block is cross-group placed, and move it
back if so
This JIRA is an umbrella for the above tools.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6009) Tools based on favored node feature for isolation

2014-03-13 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13933475#comment-13933475
]

Thanh Do commented on HDFS-6009:

Hi Yu Li,

I want to follow up on this issue. Could you please elaborate more on datanode
failure. In particular, what caused the failure in your case? Is it a disk
error, network failure, or an application is buggy?

If it is a disk error and network failure, I think isolation using datanode
group is reasonable.

Tools based on favored node feature for isolation
-

Key: HDFS-6009
URL: https://issues.apache.org/jira/browse/HDFS-6009
Project: Hadoop HDFS
Issue Type: Task
Affects Versions: 2.3.0
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6009) Tools based on favored node feature for isolation

2014-03-12 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932037#comment-13932037
 ] 

Thanh Do commented on HDFS-6009:


Thank you!

 Tools based on favored node feature for isolation
 -

 Key: HDFS-6009
 URL: https://issues.apache.org/jira/browse/HDFS-6009
 Project: Hadoop HDFS
  Issue Type: Task
Affects Versions: 2.3.0
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor

 There're scenarios like mentioned in HBASE-6721 and HBASE-4210 that in 
 multi-tenant deployments of HBase we prefer to specify several groups of 
 regionservers to serve different applications, to achieve some kind of 
 isolation or resource allocation. However, although the regionservers are 
 grouped, the datanodes which store the data are not, which leads to the case 
 that one datanode failure affects multiple applications, as we already 
 observed in our product environment.
 To relieve the above issue, we could take usage of the favored node feature 
 (HDFS-2576) to make regionserver able to locate data within its group, or say 
 make datanodes also grouped (passively), to form some level of isolation.
 In this case, or any other case that needs datanodes to group, we would need 
 a bunch of tools to maintain the group, including:
 1. Making balancer able to balance data among specified servers, rather than 
 the whole set
 2. Set balance bandwidth for specified servers, rather than the whole set
 3. Some tool to check whether the block is cross-group placed, and move it 
 back if so
 This JIRA is an umbrella for the above tools.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6009) Tools based on favored node feature for isolation

2014-03-11 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931338#comment-13931338
 ] 

Thanh Do commented on HDFS-6009:


Hi Yu, 

You mentioned although the regionservers are grouped, the datanodes which 
store the data are not, which leads to the case that one datanode failure 
affects multiple applications, as we already observed in our product 
environment.

Can you elaborate that scenarios? I thought a datanode failure will be ok, as 
the data are replicated. 

Best,

 Tools based on favored node feature for isolation
 -

 Key: HDFS-6009
 URL: https://issues.apache.org/jira/browse/HDFS-6009
 Project: Hadoop HDFS
  Issue Type: Task
Affects Versions: 2.3.0
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor

 There're scenarios like mentioned in HBASE-6721 and HBASE-4210 that in 
 multi-tenant deployments of HBase we prefer to specify several groups of 
 regionservers to serve different applications, to achieve some kind of 
 isolation or resource allocation. However, although the regionservers are 
 grouped, the datanodes which store the data are not, which leads to the case 
 that one datanode failure affects multiple applications, as we already 
 observed in our product environment.
 To relieve the above issue, we could take usage of the favored node feature 
 (HDFS-2576) to make regionserver able to locate data within its group, or say 
 make datanodes also grouped (passively), to form some level of isolation.
 In this case, or any other case that needs datanodes to group, we would need 
 a bunch of tools to maintain the group, including:
 1. Making balancer able to balance data among specified servers, rather than 
 the whole set
 2. Set balance bandwidth for specified servers, rather than the whole set
 3. Some tool to check whether the block is cross-group placed, and move it 
 back if so
 This JIRA is an umbrella for the above tools.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] Commented: (HDFS-1058) reading from file under construction fails if it reader beats writer to DN for new block

2010-11-30 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12965545#action_12965545
 ] 

Thanh Do commented on HDFS-1058:


I think this is a design decision (stated in append design document at HDFS-265)
Here we trade performance for consistency,


 reading from file under construction fails if it reader beats writer to DN 
 for new block
 

 Key: HDFS-1058
 URL: https://issues.apache.org/jira/browse/HDFS-1058
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node, hdfs client
Affects Versions: 0.21.0, 0.22.0
Reporter: Todd Lipcon

 If there is a writer and concurrent reader, the following can occur:
 - The writer allocates a new block from the NN
 - The reader calls getBlockLocations
 - Reader connects to the DN and calls getReplicaVisibleLength
 - writer still has not talked to the DN, so DN doesn't know about the block 
 and throws an error

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1350) make datanodes do graceful shutdown

2010-11-26 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12936133#action_12936133
 ] 

Thanh Do commented on HDFS-1350:


Sam, can you explain what do you mean by gracefully shutdown,
and clean shutdowns on DataXceiver threads.

If we don't do a clean shut down, client will
trigger a pipeline recovery and exclude that data node, 
and isn't this still fine.

Thanks

 make datanodes do graceful shutdown
 ---

 Key: HDFS-1350
 URL: https://issues.apache.org/jira/browse/HDFS-1350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: sam rash
Assignee: sam rash

 we found that the Datanode doesn't do a graceful shutdown and a block can be 
 corrupted (data + checksum amounts off)
 we can make the DN do a graceful shutdown in case there are open files. if 
 this presents a problem to a timely shutdown, we can make a it a parameter of 
 how long to wait for the full graceful shutdown before just exiting

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-900) Corrupt replicas are not tracked correctly through block report from DN

2010-11-24 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935390#action_12935390
 ] 

Thanh Do commented on HDFS-900:
---

Todd, do you see this on 0.21.0?
Is this a bug of handling corrupt replicas at NN?

 Corrupt replicas are not tracked correctly through block report from DN
 ---

 Key: HDFS-900
 URL: https://issues.apache.org/jira/browse/HDFS-900
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.22.0
Reporter: Todd Lipcon
Priority: Critical
 Attachments: log-commented, to-reproduce.patch


 This one is tough to describe, but essentially the following order of events 
 is seen to occur:
 # A client marks one replica of a block to be corrupt by telling the NN about 
 it
 # Replication is then scheduled to make a new replica of this node
 # The replication completes, such that there are now 3 good replicas and 1 
 corrupt replica
 # The DN holding the corrupt replica sends a block report. Rather than 
 telling this DN to delete the node, the NN instead marks this as a new *good* 
 replica of the block, and schedules deletion on one of the good replicas.
 I don't know if this is a dataloss bug in the case of 1 corrupt replica with 
 dfs.replication=2, but it seems feasible. I will attach a debug log with some 
 commentary marked by '', plus a unit test patch which I can get 
 to reproduce this behavior reliably. (it's not a proper unit test, just some 
 edits to an existing one to show it)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1103) Replica recovery doesn't distinguish between flushed-but-corrupted last chunk and unflushed last chunk

2010-11-24 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935560#action_12935560
]

Thanh Do commented on HDFS-1103:

I do not think that this is the case in 0.21 the trunk. In our lease
recovery algorithm in 0.21, If there are 2 RBWs and 1 RWR, 1 RWR is excluded
from the lease recovery. In the scenario that you described, RBW B and RBW C's
GS is bumped and the length of recovered two replicas is truncated to MIN(
len(B), len(C)).

Hairong, can you explain to me that why RBW B and RBW C's GS are bumped up.
Is that because of the lease recovery protocol?
But from my understanding, from Todd description, NN lease recovery is trigger
after Machine A report...

Replica recovery doesn't distinguish between flushed-but-corrupted last chunk
and unflushed last chunk
--

Key: HDFS-1103
URL: https://issues.apache.org/jira/browse/HDFS-1103
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node
Affects Versions: 0.21.0, 0.22.0
Reporter: Todd Lipcon
Priority: Blocker
Attachments: hdfs-1103-test.txt

When the DN creates a replica under recovery, it calls validateIntegrity,
which truncates the last checksum chunk off of a replica if it is found to be
invalid. Then when the block recovery process happens, this shortened block
wins over a longer replica from another node where there was no corruption.
Thus, if just one of the DNs has an invalid last checksum chunk, data that
has been sync()ed to other datanodes can be lost.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-89) Datanode should verify block sizes vs metadata on startup

2010-11-22 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934648#action_12934648
 ] 

Thanh Do commented on HDFS-89:
--

This is true for data file, if the length of a data block get truncated 
by ext3 fsck, NN will detect this because the NN knows the length
of the block. 
But we saw a case that once a meta file get truncated, 
and DN boots up, send a block report to NN, NN doesn't detect
that problem. 
This kind of corruption can only be detect by a reader, 
or a data block scanner. 


 Datanode should verify block sizes vs metadata on startup
 -

 Key: HDFS-89
 URL: https://issues.apache.org/jira/browse/HDFS-89
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Brian Bockelman

 I could have sworn this bug had been reported by someone else already, but I 
 can't find it on JIRA after searching apologies if this is a duplicate.
 The datanode, upon starting up, should check and make sure that all block 
 sizes as reported via `stat` are the same as the block sizes as reported via 
 the block's metadata.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1479) Massive file deletion causes some timeouts in writers

2010-11-02 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927633#action_12927633
]

Thanh Do commented on HDFS-1479:

Thanks, Zheng for the explanation.
The reason I couldn't find the AsyncDiskService because I was looking at 0.20.2
where deletion at datanode is done synchronously. Now I find it in 0.21.0.
In general, how do you plan to fix this?

Massive file deletion causes some timeouts in writers
-

Key: HDFS-1479
URL: https://issues.apache.org/jira/browse/HDFS-1479
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 0.20.2
Reporter: Zheng Shao
Assignee: Zheng Shao
Priority: Minor

When we do a massive deletion of files, we saw some timeouts in writers who's
writing to HDFS. This does not happen to all DataNodes, but it's happening
regularly enough that we would like to fix it.
{code}
yyy.xxx.com: 10/10/25 00:55:32 WARN hdfs.DFSClient: DFSOutputStream
ResponseProcessor exception for block
blk_-5459995953259765112_37619608java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.10.10.10:56319
remote=/10.10.10.10:50010]
{code}
This is caused by the default setting of AsyncDiskService, which starts 4
threads per volume to delete files.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1479) Massive file deletion causes some timeouts in writers

2010-10-28 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926159#action_12926159
 ] 

Thanh Do commented on HDFS-1479:


Thanks Dhruba.
I try to grep for AsyncDiskService but cannot find it any where in the source 
tree.
Do i misspell it?

 Massive file deletion causes some timeouts in writers
 -

 Key: HDFS-1479
 URL: https://issues.apache.org/jira/browse/HDFS-1479
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 0.20.2
Reporter: Zheng Shao
Assignee: Zheng Shao
Priority: Minor

 When we do a massive deletion of files, we saw some timeouts in writers who's 
 writing to HDFS. This does not happen to all DataNodes, but it's happening 
 regularly enough that we would like to fix it.
 {code}
 yyy.xxx.com: 10/10/25 00:55:32 WARN hdfs.DFSClient: DFSOutputStream 
 ResponseProcessor exception  for block 
 blk_-5459995953259765112_37619608java.net.SocketTimeoutException: 69000 
 millis timeout while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.10.10.10:56319 
 remote=/10.10.10.10:50010]
 {code}
 This is caused by the default setting of AsyncDiskService, which starts 4 
 threads per volume to delete files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1479) Massive file deletion causes some timeouts in writers

2010-10-27 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925530#action_12925530
 ] 

Thanh Do commented on HDFS-1479:


Can you give the detail scenario?

 Massive file deletion causes some timeouts in writers
 -

 Key: HDFS-1479
 URL: https://issues.apache.org/jira/browse/HDFS-1479
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 0.20.2
Reporter: Zheng Shao
Assignee: Zheng Shao
Priority: Minor

 When we do a massive deletion of files, we saw some timeouts in writers who's 
 writing to HDFS. This does not happen to all DataNodes, but it's happening 
 regularly enough that we would like to fix it.
 {code}
 yyy.xxx.com: 10/10/25 00:55:32 WARN hdfs.DFSClient: DFSOutputStream 
 ResponseProcessor exception  for block 
 blk_-5459995953259765112_37619608java.net.SocketTimeoutException: 69000 
 millis timeout while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.10.10.10:56319 
 remote=/10.10.10.10:50010]
 {code}
 This is caused by the default setting of AsyncDiskService, which starts 4 
 threads per volume to delete files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1380) The append pipeline does not followed TSP principal

2010-09-07 Thread Thanh Do (JIRA)

The append pipeline does not followed TSP principal
---

 Key: HDFS-1380
 URL: https://issues.apache.org/jira/browse/HDFS-1380
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20-append
Reporter: Thanh Do


1. Say we have 2 racks: rack-0 and rack-1. 
Rack-0 has dn1, dn2, dn3. Rack-0 has dn4, dn5, dn6.
 
2. Suppose client is in rack-0, and the write pipeline is:
client -- localnode -- other rack -- other rack
In this example we have the pipeline client-dn1-dn4-dn6.
That is rack0-rack0-rack1-rack1. So far so good.
 
3. Now other client comes, and append to file.
This client is also in rack-0. Interestingly,
the append pipeline is client-dn6-dn4-dn1.
That is the new client (from rack0) sends packet 
to the first node in pipeline (dn6) which belongs to rack1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1380) The append pipeline does not followed TSP principal

2010-09-07 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thanh Do updated HDFS-1380:
---

Description:
1. Say we have 2 racks: rack-0 and rack-1.
Rack-0 has dn1, dn2, dn3. Rack-0 has dn4, dn5, dn6.

2. Suppose client is in rack-0, and the write pipeline is:
client -- localnode -- other rack -- other rack
In this example we have the pipeline client-dn1-dn4-dn6.
That is rack0-rack0-rack1-rack1. So far so good.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

was:
1. Say we have 2 racks: rack-0 and rack-1.
Rack-0 has dn1, dn2, dn3. Rack-0 has dn4, dn5, dn6.

The append pipeline does not followed TSP principal
---

Key: HDFS-1380
URL: https://issues.apache.org/jira/browse/HDFS-1380
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client
Affects Versions: 0.20-append
Reporter: Thanh Do

1. Say we have 2 racks: rack-0 and rack-1.
Rack-0 has dn1, dn2, dn3. Rack-0 has dn4, dn5, dn6.

3. Now other client comes, and append to file.
This client is also in rack-0. Interestingly,
the append pipeline is client-dn6-dn4-dn1.
That is the new client (from rack0) sends packet
to the first node in pipeline (dn6) which belongs to rack1.
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1382) A transient failure with edits log and a corrupted fstime together could lead to a data loss

2010-09-07 Thread Thanh Do (JIRA)

A transient failure with edits log and a corrupted fstime together could lead 
to a data loss


 Key: HDFS-1382
 URL: https://issues.apache.org/jira/browse/HDFS-1382
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Thanh Do


We experienced a data loss situation that due to double failures.
One is transient disk failure with edits logs and the other is corrupted fstime.
 
Here is the detail:
 
1. NameNode has 2 edits directory (say edit0 and edit1)
 
2. During an update to edit0, there is a transient disk failure,
making NameNode bump the fstime and mark edit0 as stale
and continue working with edit1. 
 
3. NameNode is shut down. Now, and unluckily fstime in edit0
is corrupted. Hence during NameNode startup, the log in edit0
is replayed, hence data loss.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-86) Corrupted blocks get deleted but not replicated

2010-08-19 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900342#action_12900342
]

Thanh Do commented on HDFS-86:
--

i have a cluster of two nodes. Say a block with 2 replicas, and one of them get
corrupted.
The corrupted block is reported to NN, but it is never deleted or replicated,
even after NN restarts.
Not sure this is a bug or just a policy.
I am playing the append-trunk

Corrupted blocks get deleted but not replicated
---

Key: HDFS-86
URL: https://issues.apache.org/jira/browse/HDFS-86
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Hairong Kuang
Assignee: Hairong Kuang
Attachments: blockInvalidate.patch

When I test the patch to HADOOP-1345 on a two node dfs cluster, I see that
dfs correctly delete the corrupted replica and successfully retry reading
from the other correct replica, but the block does not get replicated. The
block remains with only 1 replica until the next block report comes in.
In my testcase, since the dfs cluster has only 2 datanodes, the target of
replication is the same as the target of block invalidation. After poking
the logs, I found out that the namenode sent the replication request before
the block invalidation request.
This is because the namenode does not invalidate a block well. In
FSNamesystem.invalidateBlock, it first puts the invalidate request in a queue
and then immediately removes the replica from its state, which triggers the
choosing a target for the block. When requests are sent back to the target
datanode as a reply to a heartbeat message, the replication requests have
higher priority than the invalidate requests.
This problem could be solved if a namenode removes an invalidated replica
from its state only after the invalidate request is sent to the datanode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1337) Unmatched file length makes append fail. Should we retry if a startBlockRecovery() fails?

2010-08-09 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1337:
---

Description: 
- Component: data node
 
- Version: 0.20-append
 
- Setup:
1) # disks / datanode = 3
2) # failures = 2
3) failure type = crash
4) When/where failure happens = (see below)
 
- Details:
Client writes to dn1-dn2-dn3. Write succeeds. We have blk_X_1001 in all dns.
Now client tries to append. It first calls dn1.recoverBlock().
This recoverBlock succeeds.  We have blk_X_1002 in all dns.
Suppose the pipeline is dn3-dn2-dn1. Client sends packet to dn3.
dn3 forwards the packet to dn2 and writes to its disk (i.e dn3's disk).
Now, *dn2 crashes*, so that dn1 has not received this packet yet.
Client calls dn1.recoverBlock() again, this time with dn3-dn1 in the pipeline.
dn1 then calls dn3.startBlockRecovery() which terminates writer thread in dn3,
get the *in memory* metadata info (i.e 512 byte length), and verifies that info 
with
the real file on disk (i.e 1024 byte length), hence the Exception.
(in this case, the block at dn3 is not finalized yet, and the 
FSDataset.setVisibleLength
has not been called, hence its visible in-memory length
is 512 byte, although its on-disk length is 1024.)
Therefore, from dn1's view, dn3 has some problem.
Now dn1 calls its own startBlockRecovery() successfully (because the on-disk
file length and memory file length match, both are 512 byte).
Now,
 + at dn1: blk_X_1003 (length 512)
 + at dn2: blk_X_1002 (length 512) 
 + at dn3: blk_X_1002 (length 1024)
 
dn1 also calls NN.commitSync (blk_X_1003, [dn1]), i.e only dn1 has a good 
replica.
After all:
- From NN point of view: dn1 is candidate for leaseRecovery
- From the client's view, dn1 is the only healthy node in the pipeline.
(it knows that by the result returned from recoverBlock).
Client starts sending a packet to dn1, now *dn1 crashes*, hence append fails.

- RE-READ: FAIL
Why? after all, dn1 and dn2 crashes. Only dn3 contains the block with GS 1002.
But NN sees blk_X_1003, because dn1 has successfully called 
commitBlockSync(blk_X_1003).
Hence, when reader asks to read the file, NN gives blk_X_1003,
and no alive dn contains that block with GS 1003.
 
- RE-APPEND with different client: FAIL
 + The file is under construction, and its holder is A1.
 
- NN.leaseRecovery(): FAIL
 + no alive target (i.e dn1, not dn3)
 + hence, as long as dn1 is not alive and the lease is not recovered, the 
file is unable to be appended
 + worse, even dn3 sends blockReport to NN and becomes target for lease 
recovery, 
 Lease recovery fails because:
  1) dn3 has block blk_X_1002 which has smaller GS
   than the block NN asks for,
  2) dn3 cannot contact dn1 which crashed

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

  was:
- Component: data node
 
- Version: 0.20-append
 
- Setup:
1) # disks / datanode = 3
2) # failures = 2
3) failure type = crash
4) When/where failure happens = (see below)
 
- Details:
Client writes to dn1-dn2-dn3. Write succeeds. We have blk_X_1001 in all dns.
Now client tries to append. It first calls dn1.recoverBlock().
This recoverBlock succeeds.  We have blk_X_1002 in all dns.
Suppose the pipeline is dn3-dn2-dn1. Client sends packet to dn3.
dn3 forwards the packet to dn2 and writes to its disk (i.e dn3's disk).
Now, *dn2 crashes*, so that dn1 has not received this packet yet.
Client calls dn1.recoverBlock() again, this time with dn3-dn1 in the pipeline.
dn1 then calls dn3.startBlockRecovery() which terminates writer thread in dn3,
get the *in memory* metadata info (i.e 512 byte length), and verifies that info 
with
the real file on disk (i.e 1024 byte length), hence the Exception.
(in this case, the block at dn3 is not finalized yet, and the 
FSDataset.setVisibleLength
has not been called, hence its visible in-memory length
is 512 byte, although its on-disk length is 1024.)
Therefore, from dn1's view, dn3 has some problem.
Now dn1 calls its own startBlockRecovery() successfully (because the on-disk
file length and memory file length match, both are 512 byte).
Now,
 + at dn1: blk_X_1003 (length 512)
 + at dn2: blk_X_1002 (length 512) 
 + at dn3: blk_X_1002 (length 1024)
 
dn1 also calls NN.commitSync (blk_X_1003, [dn1]), i.e only dn1 has a good 
replica.
After all:
- From NN point of view: dn1 is candidate for leaseRecovery
- From the client's view, dn1 is the only healthy node in the pipeline.
(it knows that by the result returned from recoverBlock).
Client starts sending a packet to dn1, now *dn1 crashes*, hence append fails.

- RE-READ: FAIL
Why? after all, dn1 and dn2 crashes. Only dn3 contains the block with GS 1002.
But NN sees blk_X_1003, because dn1 has

[jira] Created: (HDFS-1337) Unmatched file length makes append fail. Should we retry if a startBlockRecovery() fails?

2010-08-09 Thread Thanh Do (JIRA)

Unmatched file length makes append fail. Should we retry if a 
startBlockRecovery() fails?
-

 Key: HDFS-1337
 URL: https://issues.apache.org/jira/browse/HDFS-1337
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Thanh Do


- Component: data node
 
- Version: 0.20-append
 
- Setup:
1) # disks / datanode = 3
2) # failures = 2
3) failure type = crash
4) When/where failure happens = (see below)
 
- Details:
Client writes to dn1-dn2-dn3. Write succeeds. We have blk_X_1001 in all dns.
Now client tries to append. It first calls dn1.recoverBlock().
This recoverBlock succeeds.  We have blk_X_1002 in all dns.
Suppose the pipeline is dn3-dn2-dn1. Client sends packet to dn3.
dn3 forwards the packet to dn2 and writes to its disk (i.e dn3's disk).
Now, *dn2 crashes*, so that dn1 has not received this packet yet.
Client calls dn1.recoverBlock() again, this time with dn3-dn1 in the pipeline.
dn1 then calls dn3.startBlockRecovery() which terminates writer thread in dn3,
get the *in memory* metadata info (i.e 512 byte length), and verifies that info 
with
the real file on disk (i.e 1024 byte length), hence the Exception.
(in this case, the block at dn3 is not finalized yet, and the 
FSDataset.setVisibleLength
has not been called, hence its visible in-memory length
is 512 byte, although its on-disk length is 1024.)
Therefore, from dn1's view, dn3 has some problem.
Now dn1 calls its own startBlockRecovery() successfully (because the on-disk
file length and memory file length match, both are 512 byte).
Now,
 + at dn1: blk_X_1003 (length 512)
 + at dn2: blk_X_1002 (length 512) 
 + at dn3: blk_X_1002 (length 1024)
 
dn1 also calls NN.commitSync (blk_X_1003, [dn1]), i.e only dn1 has a good 
replica.
After all:
- From NN point of view: dn1 is candidate for leaseRecovery
- From the client's view, dn1 is the only healthy node in the pipeline.
(it knows that by the result returned from recoverBlock).
Client starts sending a packet to dn1, now *dn1 crashes*, hence append fails.

- RE-READ: FAIL
Why? after all, dn1 and dn2 crashes. Only dn3 contains the block with GS 1002.
But NN sees blk_X_1003, because dn1 has successfully called 
commitBlockSync(blk_X_1003).
Hence, when reader asks to read the file, NN gives blk_X_1003,
and no alive dn contains that block with GS 1003.
 
- RE-APPEND with different client: FAIL
 + The file is under construction, and its holder is A1.
 
- NN.leaseRecovery(): FAIL
 + no alive target (i.e dn1, not dn3)
 + hence, as long as dn1 is not alive and the lease is not recovered, the 
file is unable to be appended
 + worse, even dn3 sends blockReport to NN and becomes target for lease 
recovery, 
 Lease recovery fails because:
  1) dn3 has block blk_X_1002 which has smaller GS
   than the block NN asks for,
  2) dn3 cannot contact dn1 which crashed

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1227) UpdateBlock fails due to unmatched file length

2010-07-16 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889302#action_12889302
]

Thanh Do commented on HDFS-1227:

when startBlockRecovery is called, the writer thread is interrupted. But the
effect/changes that this write made to disk (if any) is still there, right?
Hence this exception still happens (after HDFS-1186 is committed).

UpdateBlock fails due to unmatched file length
--

Key: HDFS-1227
URL: https://issues.apache.org/jira/browse/HDFS-1227
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node
Affects Versions: 0.20-append
Reporter: Thanh Do

- Summary: client append is not atomic, hence, it is possible that
when retrying during append, there is an exception in updateBlock
indicating unmatched file length, making append failed.

- Setup:
+ # available datanodes = 3
+ # disks / datanode = 1
+ # failures = 2
+ failure type = bad disk
+ When/where failure happens = (see below)
+ This bug is non-deterministic, to reproduce it, add a sufficient sleep
before out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3

- Details:
Suppose client appends 16 bytes to block X which has length 16 bytes at dn1,
dn2, dn3.
Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds.
Client starts sending data to the dn3 - the first datanode in pipeline.
dn3 forwards the packet to downstream datanodes, and starts writing
data to its disk. Suppose there is an exception in dn3 when writing to disk.
Client gets the exception, it starts the recovery code by calling
dn1.recoverBlock() again.
dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build
the syncList.
Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and
dn2),
the previous packet (which is sent from dn3) has not come to disk yet.
Hence, the block Info given by getMetaDataInfo contains the length of 16
bytes.
But after that, the packet comes to disk, making the block file length now
becomes 32 bytes.
Using the syncList (with contains block info with length 16 byte), dn1 calls
updateBlock at
dn2 and dn1, which will failed, because the length of new block info (given
by updateBlock,
which is 16 byte) does not match with its actual length on disk (which is 32
byte)

Note that this bug is non-deterministic. Its depends on the thread
interleaving
at datanodes.
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1227) UpdateBlock fails due to unmatched file length

2010-07-16 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1227:
---

Description: 
- Summary: client append is not atomic, hence, it is possible that
when retrying during append, there is an exception in updateBlock
indicating unmatched file length, making append failed.
 
- Setup:
+ # available datanodes = 3
+ # disks / datanode = 1
+ # failures = 1
+ failure type = bad disk
+ When/where failure happens = (see below)
+ This bug is non-deterministic, to reproduce it, add a sufficient sleep before 
out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3
 
- Details:
 Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, 
dn2, dn3.
Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds.
Client starts sending data to the dn3 - the first datanode in pipeline.
dn3 forwards the packet to downstream datanodes, and starts writing
data to its disk. Suppose there is an exception in dn3 when writing to disk.
Client gets the exception, it starts the recovery code by calling 
dn1.recoverBlock() again.
dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build the 
syncList.
Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and dn2),
the previous packet (which is sent from dn3) has not come to disk yet.
Hence, the block Info given by getMetaDataInfo contains the length of 16 bytes.
But after that, the packet comes to disk, making the block file length now 
becomes 32 bytes.
Using the syncList (with contains block info with length 16 byte), dn1 calls 
updateBlock at
dn2 and dn1, which will failed, because the length of new block info (given by 
updateBlock,
which is 16 byte) does not match with its actual length on disk (which is 32 
byte)
 
Note that this bug is non-deterministic. Its depends on the thread interleaving
at datanodes.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)



  was:
- Summary: client append is not atomic, hence, it is possible that
when retrying during append, there is an exception in updateBlock
indicating unmatched file length, making append failed.
 
- Setup:
+ # available datanodes = 3
+ # disks / datanode = 1
+ # failures = 2
+ failure type = bad disk
+ When/where failure happens = (see below)
+ This bug is non-deterministic, to reproduce it, add a sufficient sleep before 
out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3
 
- Details:
 Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, 
dn2, dn3.
Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds.
Client starts sending data to the dn3 - the first datanode in pipeline.
dn3 forwards the packet to downstream datanodes, and starts writing
data to its disk. Suppose there is an exception in dn3 when writing to disk.
Client gets the exception, it starts the recovery code by calling 
dn1.recoverBlock() again.
dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build the 
syncList.
Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and dn2),
the previous packet (which is sent from dn3) has not come to disk yet.
Hence, the block Info given by getMetaDataInfo contains the length of 16 bytes.
But after that, the packet comes to disk, making the block file length now 
becomes 32 bytes.
Using the syncList (with contains block info with length 16 byte), dn1 calls 
updateBlock at
dn2 and dn1, which will failed, because the length of new block info (given by 
updateBlock,
which is 16 byte) does not match with its actual length on disk (which is 32 
byte)
 
Note that this bug is non-deterministic. Its depends on the thread interleaving
at datanodes.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)




 UpdateBlock fails due to unmatched file length
 --

 Key: HDFS-1227
 URL: https://issues.apache.org/jira/browse/HDFS-1227
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Thanh Do

 - Summary: client append is not atomic, hence, it is possible that
 when retrying during append, there is an exception in updateBlock
 indicating unmatched file length, making append failed.
  
 - Setup:
 + # available datanodes = 3
 + # disks / datanode = 1
 + # failures = 1
 + failure type = bad disk
 + When/where failure happens = (see below)
 + This bug is non-deterministic, to reproduce it, add a sufficient sleep 
 before

[jira] Commented: (HDFS-1227) UpdateBlock fails due to unmatched file length

2010-07-15 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1293#action_1293
]

Thanh Do commented on HDFS-1227:

In the append-branch, I saw the unmatched file length exception happens, but
then the client retries RecoverBlock, hence, tolerates this

UpdateBlock fails due to unmatched file length
--

Key: HDFS-1227
URL: https://issues.apache.org/jira/browse/HDFS-1227
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node
Affects Versions: 0.20-append
Reporter: Thanh Do

- Summary: client append is not atomic, hence, it is possible that
when retrying during append, there is an exception in updateBlock
indicating unmatched file length, making append failed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock

2010-07-15 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889001#action_12889001
 ] 

Thanh Do commented on HDFS-1229:


this does not happens in the append+320 trunk.

 DFSClient incorrectly asks for new block if primary crashes during first 
 recoverBlock
 -

 Key: HDFS-1229
 URL: https://issues.apache.org/jira/browse/HDFS-1229
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20-append
Reporter: Thanh Do

 Setup:
 
 + # available datanodes = 2
 + # disks / datanode = 1
 + # failures = 1
 + failure type = crash
 + When/where failure happens = during primary's recoverBlock
  
 Details:
 --
 Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
 First it needs to make sure both dn1 and dn2  agree on the new GS of the 
 block.
 1) Client first creates DFSOutputStream by calling
  
 OutputStream result = new DFSOutputStream(src, buffersize, progress,
 lastBlock, stat, 
  conf.getInt(io.bytes.per.checksum, 512));
  
 in DFSClient.append()
  
 2) The above DFSOutputStream constructor in turn calls 
 processDataNodeError(true, true)
 (i.e, hasError = true, isAppend = true), and starts the DataStreammer
  
  processDatanodeError(true, true);  /* let's call this PDNE 1 */
  streamer.start();
  
 Note that DataStreammer.run() also calls processDatanodeError()
  while (!closed  clientRunning) {
   ...
   boolean doSleep = processDatanodeError(hasError, false); /let's call 
  this PDNE 2*/
  
 3) Now in the PDNE 1, we have following code:
  
  blockStream = null;
  blockReplyStream = null;
  ...
  while (!success  clientRunning) {
  ...
 try {
  primary = createClientDatanodeProtocolProxy(primaryNode, conf);
  newBlock = primary.recoverBlock(block, isAppend, newnodes); 
  /*exception here*/
  ...
 catch (IOException e) {
  ...
  if (recoveryErrorCount  maxRecoveryErrorCount) { 
  // this condition is false
  }
  ...
  return true;
 } // end catch
 finally {...}
 
 this.hasError = false;
 lastException = null;
 errorIndex = 0;
 success = createBlockOutputStream(nodes, clientName, true);
 }
 ...
  
 Because dn1 crashes during client call to recoverBlock, we have an exception.
 Hence, go to the catch block, in which processDatanodeError returns true
 before setting hasError to false. Also, because createBlockOutputStream() is 
 not called
 (due to an early return), blockStream is still null.
  
 4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2.
 Because hasError = false, PDNE 2 returns false immediately without doing 
 anything
  if (!hasError) { return false; }
  
 5) still in the DataStreamer.run(), after returning false from PDNE 2, we 
 still have
 blockStream = null, hence the following code is executed:
 if (blockStream == null) {
nodes = nextBlockOutputStream(src);
this.setName(DataStreamer for file  + src +  block  + block);
response = new ResponseProcessor(nodes);
response.start();
 }
  
 nextBlockOutputStream which asks namenode to allocate new Block is called.
 (This is not good, because we are appending, not writing).
 Namenode gives it new Block ID and a set of datanodes, including crashed dn1.
 this leads to createOutputStream() fails because it tries to contact the dn1 
 first.
 (which has crashed). The client retries 5 times without any success,
 because every time, it asks namenode for new block! Again we see
 that the retry logic at client is weird!
 *This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-22 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881526#action_12881526
 ] 

Thanh Do commented on HDFS-1220:


it is not exactly the same as HDFS-1221, although fstime suffered from 
corruption too
(which may lead to data loss). 
In this case, i think the update to fstime should be atomic, or NameNode some 
how
should anticipate reading an empty fstime.

 Namenode unable to start due to truncated fstime
 

 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: updating fstime file on disk is not atomic, so it is possible that
 if a crash happens in the middle, next time when NameNode reboots, it will
 read stale fstime, hence unable to start successfully.
  
 - Details:
 Basically, this involve 3 steps:
 1) delete fstime file (timeFile.delete())
 2) truncate fstime file (new FileOutputStream(timeFile))
 3) write new time to fstime file (out.writeLong(checkpointTime))
 If a crash happens after step 2 and before step 3, in the next reboot, 
 NameNode
 got an exception when reading the time (8 byte) from an empty fstime file.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1222) NameNode fail stop in spite of multiple metadata directories

2010-06-22 Thread Thanh Do (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881527#action_12881527
 ] 

Thanh Do commented on HDFS-1222:


Konstantin, this is the namenode start up workload. 
when namenode gets an exception, it fails, but not tolerate, 
i.e not retry with other image if there is any.
(this may be due the design choice that already been made)


 NameNode fail stop in spite of multiple metadata directories
 

 Key: HDFS-1222
 URL: https://issues.apache.org/jira/browse/HDFS-1222
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 Despite the ability to configure multiple name directories
 (to store fsimage) and edits directories, the NameNode will fail stop 
 in most of the time it faces exception when accessing to these directories.
  
 NameNode fail stops if an exception happens when loading fsimage,
 reading fstime, loading edits log, writing fsimage.ckpt ..., although there 
 are still good replicas. NameNode could have tried to work with other 
 replicas,
 and marked the faulty one.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1224) Stale connection makes node miss append

2010-06-22 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881539#action_12881539
]

Thanh Do commented on HDFS-1224:

Even so, does this cause any actual problems aside from a shorter pipeline?

I'm not sure, but based on the description, it sounds like dn2 thinks it has a
block (but it is incomplete), so a client might end up trying to get a block
from that node and get an incomplete block

I think this does not create any problem aside from shorter pipeline.
dn2 has a block with old time stamp, because it misses updateBlock.
Hence the block at dn2 is finally deleted.
(but the append semantic is not guaranteed, right? because there are 3
alive datanodes, and write to all 3 is successful, but append only happen
successfully at 2 datanodes).

Stale connection makes node miss append
---

Key: HDFS-1224
URL: https://issues.apache.org/jira/browse/HDFS-1224
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node
Affects Versions: 0.20-append
Reporter: Thanh Do

- Summary: if a datanode crashes and restarts, it may miss an append.

- Setup:
+ # available datanodes = 3
+ # replica = 3
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = after the first append succeed

- Details:
Since each datanode maintains a pool of IPC connections, whenever it wants
to make an IPC call, it first looks into the pool. If the connection is not
there,
it is created and put in to the pool. Otherwise the existing connection is
used.
Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the
primary.
After the client appends to block X successfully, dn2 crashes and restarts.
Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
Client starts appending to block Y. It first calls dn1.recoverBlock().
Dn1 will first create a proxy corresponding with each of the datanode in the
pipeline
(in order to make RPC call like getMetadataInfo( ) or updateBlock( )).
However, because
dn2 has just crashed and restarts, its connection in dn1's pool become stale.
Dn1 uses
this connection to make a call to dn2, hence an exception. Therefore, append
will be
made only to dn1 and dn3, although dn2 is alive and the write of block Y to
dn2 has
been successful.
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1239) All datanodes are bad in 2nd phase

2010-06-20 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880648#action_12880648
]

Thanh Do commented on HDFS-1239:

overall we see that the client-namenode protocol does not allow
the client to say to the namenode something like hey, i tried to write to
the datanodes you've given me, but it fails, could you give me other
datanodes please? the reason is the cloud should have more machines,
and maybe it makes more sense if the client could be given another set of
datanodes

All datanodes are bad in 2nd phase
--

Key: HDFS-1239
URL: https://issues.apache.org/jira/browse/HDFS-1239
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do

- Setups:
number of datanodes = 2
replication factor = 2
Type of failure: transient fault (a java i/o call throws an exception or
return false)
Number of failures = 2
when/where failures happen = during the 2nd phase of the pipeline, each
happens at each datanode when trying to perform I/O
(e.g. dataoutputstream.flush())

- Details:

This is similar to HDFS-1237.
In this case, node1 throws exception that makes client creates
a pipeline only with node2, then tries to redo the whole thing,
which throws another failure. So at this point, the client considers
all datanodes are bad, and never retries the whole thing again,
(i.e. it never asks the namenode again to ask for a new set of datanodes).
In HDFS-1237, the bug is due to permanent disk fault. In this case, it's
about transient error.
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1222) NameNode fail stop in spite of multiple metadata directories

2010-06-20 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880649#action_12880649
]

Thanh Do commented on HDFS-1222:

Triggering the rare cases is the goal of our project.
We have read some papers saying that rare failure do happen,
and when they happen, the system does not behave as expected.
Thus, our view is that we should expect the unexpected.

NameNode fail stop in spite of multiple metadata directories

Key: HDFS-1222
URL: https://issues.apache.org/jira/browse/HDFS-1222
Project: Hadoop HDFS
Issue Type: Bug
Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

Despite the ability to configure multiple name directories
(to store fsimage) and edits directories, the NameNode will fail stop
in most of the time it faces exception when accessing to these directories.

NameNode fail stops if an exception happens when loading fsimage,
reading fstime, loading edits log, writing fsimage.ckpt ..., although there
are still good replicas. NameNode could have tried to work with other
replicas,
and marked the faulty one.
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1224) Stale connection makes node miss append

2010-06-20 Thread Thanh Do (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880650#action_12880650
]

Thanh Do commented on HDFS-1224:

Todd, you are right. This is a rare case, and as long as you have long enough
pipeline
this is not a problem. But again, triggering rare case is the goal of our
project.

Stale connection makes node miss append
---

Key: HDFS-1224
URL: https://issues.apache.org/jira/browse/HDFS-1224
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

- Summary: if a datanode crashes and restarts, it may miss an append.

- Setup:
+ # available datanodes = 3
+ # replica = 3
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = after the first append succeed

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1239) All datanodes are bad in 2nd phase

2010-06-17 Thread Thanh Do (JIRA)

All datanodes are bad in 2nd phase
--

 Key: HDFS-1239
 URL: https://issues.apache.org/jira/browse/HDFS-1239
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Setups:
number of datanodes = 2
replication factor = 2
Type of failure: transient fault (a java i/o call throws an exception or return 
false)
Number of failures = 2
when/where failures happen = during the 2nd phase of the pipeline, each happens 
at each datanode when trying to perform I/O 
(e.g. dataoutputstream.flush())
 
- Details:
 
This is similar to HDFS-1237.
In this case, node1 throws exception that makes client creates
a pipeline only with node2, then tries to redo the whole thing,
which throws another failure. So at this point, the client considers
all datanodes are bad, and never retries the whole thing again, 
(i.e. it never asks the namenode again to ask for a new set of datanodes).
In HDFS-1237, the bug is due to permanent disk fault. In this case, it's about 
transient error.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1219) Data Loss due to edits log truncation

2010-06-16 Thread Thanh Do (JIRA)

Data Loss due to edits log truncation
-

 Key: HDFS-1219
 URL: https://issues.apache.org/jira/browse/HDFS-1219
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.2
Reporter: Thanh Do


We found this problem almost at the same time as HDFS developers.
Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage.
Hence, any crash happens after the truncation but before the renaming will lead
to a data loss. Detailed description can be found here:
https://issues.apache.org/jira/browse/HDFS-955

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Thanh Do (JIRA)

Namenode unable to start due to truncated fstime


 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Below is the code for updating fstime file on disk
  void writeCheckpointTime(StorageDirectory sd) throws IOException {
if (checkpointTime  0L)
  return; // do not write negative time 

 
File timeFile = getImageFile(sd, NameNodeFile.TIME);
if (timeFile.exists()) { timeFile.delete(); }
DataOutputStream out = new DataOutputStream(
new FileOutputStream(timeFile));
try {
  out.writeLong(checkpointTime);
} finally {
  out.close();
}
  }
 
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1220:
---

Description: 
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

  was:
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Below is the code for updating fstime file on disk
  void writeCheckpointTime(StorageDirectory sd) throws IOException {
if (checkpointTime  0L)
  return; // do not write negative time 

 
File timeFile = getImageFile(sd, NameNodeFile.TIME);
if (timeFile.exists()) { timeFile.delete(); }
DataOutputStream out = new DataOutputStream(
new FileOutputStream(timeFile));
try {
  out.writeLong(checkpointTime);
} finally {
  out.close();
}
  }
 
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu


 Namenode unable to start due to truncated fstime
 

 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: updating fstime file on disk is not atomic, so it is possible that
 if a crash happens in the middle, next time when NameNode reboots, it will
 read stale fstime, hence unable to start successfully.
  
 - Details:
 Basically, this involve 3 steps:
 1) delete fstime file (timeFile.delete())
 2) truncate fstime file (new FileOutputStream(timeFile))
 3) write new time to fstime file (out.writeLong(checkpointTime))
 If a crash happens after step 2 and before step 3, in the next reboot, 
 NameNode
 got an exception when reading the time (8 byte) from an empty fstime file.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1220:
---

Description: 
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

  was:
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Below is the code for updating fstime file on disk
  void writeCheckpointTime(StorageDirectory sd) throws IOException {
if (checkpointTime  0L)
  return; // do not write negative time
File timeFile = getImageFile(sd, NameNodeFile.TIME);
if (timeFile.exists()) { timeFile.delete(); }
DataOutputStream out = new DataOutputStream(
new FileOutputStream(timeFile));
try {
  out.writeLong(checkpointTime);
} finally {
  out.close();
}
  }


Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu


 Namenode unable to start due to truncated fstime
 

 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: updating fstime file on disk is not atomic, so it is possible that
 if a crash happens in the middle, next time when NameNode reboots, it will
 read stale fstime, hence unable to start successfully.
  
 - Details:
 Basically, this involve 3 steps:
 1) delete fstime file (timeFile.delete())
 2) truncate fstime file (new FileOutputStream(timeFile))
 3) write new time to fstime file (out.writeLong(checkpointTime))
 If a crash happens after step 2 and before step 3, in the next reboot, 
 NameNode
 got an exception when reading the time (8 byte) from an empty fstime file.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1220:
---

Description: 
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Below is the code for updating fstime file on disk
  void writeCheckpointTime(StorageDirectory sd) throws IOException {
if (checkpointTime  0L)
  return; // do not write negative time
File timeFile = getImageFile(sd, NameNodeFile.TIME);
if (timeFile.exists()) { timeFile.delete(); }
DataOutputStream out = new DataOutputStream(
new FileOutputStream(timeFile));
try {
  out.writeLong(checkpointTime);
} finally {
  out.close();
}
  }


Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

  was:
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu


 Namenode unable to start due to truncated fstime
 

 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: updating fstime file on disk is not atomic, so it is possible that
 if a crash happens in the middle, next time when NameNode reboots, it will
 read stale fstime, hence unable to start successfully.
  
 - Details:
 Below is the code for updating fstime file on disk
   void writeCheckpointTime(StorageDirectory sd) throws IOException {
 if (checkpointTime  0L)
   return; // do not write negative time
 File timeFile = getImageFile(sd, NameNodeFile.TIME);
 if (timeFile.exists()) { timeFile.delete(); }
 DataOutputStream out = new DataOutputStream(
 new 
 FileOutputStream(timeFile));
 try {
   out.writeLong(checkpointTime);
 } finally {
   out.close();
 }
   }
 Basically, this involve 3 steps:
 1) delete fstime file (timeFile.delete())
 2) truncate fstime file (new FileOutputStream(timeFile))
 3) write new time to fstime file (out.writeLong(checkpointTime))
 If a crash happens after step 2 and before step 3, in the next reboot, 
 NameNode
 got an exception when reading the time (8 byte) from an empty fstime file.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1221) NameNode unable to start due to stale edits log after a crash

2010-06-16 Thread Thanh Do (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1221:
---

Description: 
- Summary: 
If a crash happens during FSEditLog.createEditLogFile(), the
edits log file on disk may be stale. During next reboot, NameNode 
will get an exception when parsing the edits file, because of stale data, 
leading to unsuccessful reboot.
Note: This is just one example. Since we see that edits log (and fsimage)
does not have checksum, they are vulnerable to corruption too.
 
- Details:
The steps to create new edits log (which we infer from HDFS code) are:
1) truncate the file to zero size
2) write FSConstants.LAYOUT_VERSION to buffer
3) insert the end-of-file marker OP_INVALID to the end of the buffer
4) preallocate 1MB of data, and fill the data with 0
5) flush the buffer to disk
 
Note that only in step 1, 4, 5, the data on disk is actually changed.
Now, suppose a crash happens after step 4, but before step 5.
In the next reboot, NameNode will fetch this edits log file (which contains
all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK,
because NameNode has code to handle that case.
(but we expect LAYOUT_VERSION to be -18, don't we). 
Now it parses the operation code, which happens to be 0. Unfortunately, since 0
is the value for OP_ADD, the NameNode expects some parameters corresponding 
to that operation. Now NameNode calls readString to read the path, which throws
an exception leading to a failed reboot.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

  was:
- Summary: 
If a crash happens during FSEditLog.createEditLogFile(), the
edits log file on disk may be stale. During next reboot, NameNode 
will get an exception when parsing the edits file, because of stale data, 
leading to unsuccessful reboot.
Note: This is just one example. Since we see that edits log (and fsimage)
does not have checksum, they are vulnerable to corruption too.
 
- Details:
The steps to create new edits log (which we infer from HDFS code) are:
1) truncate the file to zero size
2) write FSConstants.LAYOUT_VERSION to buffer
3) insert the end-of-file marker OP_INVALID to the end of the buffer
4) preallocate 1MB of data, and fill the data with 0
5) flush the buffer to disk
 
Note that only in step 1, 4, 5, the data on disk is actually changed.
Now, suppose a crash happens after step 4, but before step 5.
In the next reboot, NameNode will fetch this edits log file (which contains
all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK,
because NameNode has code to handle that case.
(but we expect LAYOUT_VERSION to be -18, don't we). 
Now it parses the operation code, which happens to be 0. Unfortunately, since 0
is the value for OP_ADD, the NameNode expects some parameters corresponding 
to that operation. Now NameNode calls readString to read the path, which throws
an exception leading to a failed reboot.

We found this problem almost at the same time as HDFS developers.
Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage.
Hence, any crash happens after the truncation but before the renaming will lead
to a data loss. Detailed description can be found here:
https://issues.apache.org/jira/browse/HDFS-955
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

Component/s: name-node

 NameNode unable to start due to stale edits log after a crash
 -

 Key: HDFS-1221
 URL: https://issues.apache.org/jira/browse/HDFS-1221
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: 
 If a crash happens during FSEditLog.createEditLogFile(), the
 edits log file on disk may be stale. During next reboot, NameNode 
 will get an exception when parsing the edits file, because of stale data, 
 leading to unsuccessful reboot.
 Note: This is just one example. Since we see that edits log (and fsimage)
 does not have checksum, they are vulnerable to corruption too.
  
 - Details:
 The steps to create new edits log (which we infer from HDFS code) are:
 1) truncate the file to zero size
 2) write FSConstants.LAYOUT_VERSION to buffer
 3) insert the end-of-file marker OP_INVALID to the end of the buffer
 4) preallocate 1MB of data, and fill the data with 0
 5) flush the buffer to disk
  
 Note that only in step 1, 4, 5, the data on disk is actually changed.
 Now, suppose a crash happens after step 4, but before

[jira] Created: (HDFS-1221) NameNode unable to start due to stale edits log after a crash

2010-06-16 Thread Thanh Do (JIRA)

NameNode unable to start due to stale edits log after a crash
-

 Key: HDFS-1221
 URL: https://issues.apache.org/jira/browse/HDFS-1221
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: 
If a crash happens during FSEditLog.createEditLogFile(), the
edits log file on disk may be stale. During next reboot, NameNode 
will get an exception when parsing the edits file, because of stale data, 
leading to unsuccessful reboot.
Note: This is just one example. Since we see that edits log (and fsimage)
does not have checksum, they are vulnerable to corruption too.
 
- Details:
The steps to create new edits log (which we infer from HDFS code) are:
1) truncate the file to zero size
2) write FSConstants.LAYOUT_VERSION to buffer
3) insert the end-of-file marker OP_INVALID to the end of the buffer
4) preallocate 1MB of data, and fill the data with 0
5) flush the buffer to disk
 
Note that only in step 1, 4, 5, the data on disk is actually changed.
Now, suppose a crash happens after step 4, but before step 5.
In the next reboot, NameNode will fetch this edits log file (which contains
all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK,
because NameNode has code to handle that case.
(but we expect LAYOUT_VERSION to be -18, don't we). 
Now it parses the operation code, which happens to be 0. Unfortunately, since 0
is the value for OP_ADD, the NameNode expects some parameters corresponding 
to that operation. Now NameNode calls readString to read the path, which throws
an exception leading to a failed reboot.

We found this problem almost at the same time as HDFS developers.
Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage.
Hence, any crash happens after the truncation but before the renaming will lead
to a data loss. Detailed description can be found here:
https://issues.apache.org/jira/browse/HDFS-955
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1222) NameNode fail stop in spite of multiple metadata directories

2010-06-16 Thread Thanh Do (JIRA)

NameNode fail stop in spite of multiple metadata directories


 Key: HDFS-1222
 URL: https://issues.apache.org/jira/browse/HDFS-1222
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do


Despite the ability to configure multiple name directories
(to store fsimage) and edits directories, the NameNode will fail stop 
in most of the time it faces exception when accessing to these directories.
 
NameNode fail stops if an exception happens when loading fsimage,
reading fstime, loading edits log, writing fsimage.ckpt ..., although there 
are still good replicas. NameNode could have tried to work with other replicas,
and marked the faulty one.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HDFS-1223) DataNode fails stop due to a bad disk (or storage directory)

2010-06-16 Thread Thanh Do (JIRA)

DataNode fails stop due to a bad disk (or storage directory)


 Key: HDFS-1223
 URL: https://issues.apache.org/jira/browse/HDFS-1223
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


A datanode can store block files in multiple volumes.
If a datanode sees a bad volume during start up (i.e, face an exception
when accessing that volume), it simply fail stops, making all block files
stored in other healthy volumes inaccessible. Consequently, these lost
replicas will be generated later on in other datanodes. 
If a datanode is able to mark the bad disk and continue working with
healthy ones, this will increase availability and avoid unnecessary 
regeneration. As an extreme example, consider one datanode which has
2 volumes V1 and V2, each contains about 1 64MB block files.
During startup, the datanode gets an exception when accessing V1, it then 
fail stops, making 2 block files generated later on.
If the datanode masks V1 as bad and continues working with V2, the number
of replicas needed to be regenerated is cut in to half.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 >

1 - 100 of 121 matches

Mail list logo