Re: symlink support in Hadoop 2 GA
Colin posted a summary of our phone call yesterday (attendees: myself, Colin, Daryn, Nathan, Jason, Chris, Suresh, Sanjay) on HADOOP-9984: https://issues.apache.org/jira/browse/HADOOP-9984?focusedCommentId=13785701page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13785701 Pasted here: - We discussed alternatives to HADOOP-9984https://issues.apache.org/jira/browse/HADOOP-9984, but concluded that they weren't workable. - We agreed that doing the symlink resolution in each Filesystem subclass is what we ought to do in 9984, in order to keep compatibility with out-of-tree filesystems. - We agreed to disable symlink resolution in Hadoop 2 GA. We will spend a few weeks ironing out all the bugs and enable it in Hadoop 2.3. However, we would like to make all backwards-incompatible API changes prior to Hadoop 2 GA. - We agreed that HADOOP-9972https://issues.apache.org/jira/browse/HADOOP-9972 (new symlink-aware API for globStatus) should get into Hadoop 2 GA. - We discussed the issue of returning resolved paths versus unresolved paths, but were unable to come to any conclusion. Everyone agreed that there would be serious performance problems if we returned unresolved paths, but some claimed that programs would break when encountering resolved paths. There's also a new umbrella issue at HADOOP-10019 tracking on-going symlinks changes. Best, Andrew On Thu, Oct 3, 2013 at 2:08 PM, Daryn Sharp da...@yahoo-inc.com wrote: I reluctantly agree that we should disable symlinks in 2.2 until we can sort out the compatibility issues. I'm reluctant in the sense that its a feature users have long wanted, and it's something we'd like to use from an administrative view. However I don't see all the issues being shorted out in the very near future. I filed some jiras today that have led me to believe that the current implementation of fs symlinks is irreparably flawed. Adding optional primitives to filesystems to make them symlink capable is ok. However, adding symlink resolution to individual filesystems is fundamentally broken. It doesn't work for stacked filesystems (viewfs, chroots, filters, etc) because the resolution must occur at the highest level, not within an individual filesystem itself. Otherwise the abstraction of the top-level filesystem is violated and all kinds of unexpected behavior like walking out of chroots becomes possible. Daryn On Oct 3, 2013, at 1:39 PM, sanjay Radia wrote: There are a number of issues (some minor, some more than minor). GA is close and we are are still in discussion on the some of them; while I believe we will close on these very very shortly, code change like this so close to GA is dangerous. I suggest we do the following: 1) Disable Symlinks in 2.2 GA- throw unsupported exception on createSymlink in both FileSystem and FileContext. 2) Deal with the isDir() in 2.2GA in preparation for item 3 coming after GA: a) Deprecate isDir() b) Add a new API that returns an enum (see FileContext). 3) Fix Symlinks, in a future release, hopefully the very next one after 2.2GA a) change the stack to use the new API replacing isDir(). b) fix isDIr() to do something smarter (we can detail this later but there is a solution that has been discussed). This helps customer applications that call isDir(). c) Remove isDir in a future release when customers have had sufficient time to migrate. sanjay PS. J Rottinghuis expressed a similar sentiment in a previous email in this thread: On Sep 18, 2013, at 5:11 PM, J. Rottinghuis wrote: I like symlink functionality, but in our migration to Hadoop 2.x this is a total distraction. If the APIs stay in 2.2 GA we'll have to choose to: a) Not uprev until symlink support is figured out up and down the stack, and we've been able to migrate all our 1.x (equivalent) clusters to 2.x (equivalent). Or b) rip out the API altogether. Or c) change the implementation to throw an UnsupportedOperationException I'm not sure yet which of these I like least. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
There are a number of issues (some minor, some more than minor). GA is close and we are are still in discussion on the some of them; while I believe we will close on these very very shortly, code change like this so close to GA is dangerous. I suggest we do the following: 1) Disable Symlinks in 2.2 GA- throw unsupported exception on createSymlink in both FileSystem and FileContext. 2) Deal with the isDir() in 2.2GA in preparation for item 3 coming after GA: a) Deprecate isDir() b) Add a new API that returns an enum (see FileContext). 3) Fix Symlinks, in a future release, hopefully the very next one after 2.2GA a) change the stack to use the new API replacing isDir(). b) fix isDIr() to do something smarter (we can detail this later but there is a solution that has been discussed). This helps customer applications that call isDir(). c) Remove isDir in a future release when customers have had sufficient time to migrate. sanjay PS. J Rottinghuis expressed a similar sentiment in a previous email in this thread: On Sep 18, 2013, at 5:11 PM, J. Rottinghuis wrote: I like symlink functionality, but in our migration to Hadoop 2.x this is a total distraction. If the APIs stay in 2.2 GA we'll have to choose to: a) Not uprev until symlink support is figured out up and down the stack, and we've been able to migrate all our 1.x (equivalent) clusters to 2.x (equivalent). Or b) rip out the API altogether. Or c) change the implementation to throw an UnsupportedOperationException I'm not sure yet which of these I like least. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
I reluctantly agree that we should disable symlinks in 2.2 until we can sort out the compatibility issues. I'm reluctant in the sense that its a feature users have long wanted, and it's something we'd like to use from an administrative view. However I don't see all the issues being shorted out in the very near future. I filed some jiras today that have led me to believe that the current implementation of fs symlinks is irreparably flawed. Adding optional primitives to filesystems to make them symlink capable is ok. However, adding symlink resolution to individual filesystems is fundamentally broken. It doesn't work for stacked filesystems (viewfs, chroots, filters, etc) because the resolution must occur at the highest level, not within an individual filesystem itself. Otherwise the abstraction of the top-level filesystem is violated and all kinds of unexpected behavior like walking out of chroots becomes possible. Daryn On Oct 3, 2013, at 1:39 PM, sanjay Radia wrote: There are a number of issues (some minor, some more than minor). GA is close and we are are still in discussion on the some of them; while I believe we will close on these very very shortly, code change like this so close to GA is dangerous. I suggest we do the following: 1) Disable Symlinks in 2.2 GA- throw unsupported exception on createSymlink in both FileSystem and FileContext. 2) Deal with the isDir() in 2.2GA in preparation for item 3 coming after GA: a) Deprecate isDir() b) Add a new API that returns an enum (see FileContext). 3) Fix Symlinks, in a future release, hopefully the very next one after 2.2GA a) change the stack to use the new API replacing isDir(). b) fix isDIr() to do something smarter (we can detail this later but there is a solution that has been discussed). This helps customer applications that call isDir(). c) Remove isDir in a future release when customers have had sufficient time to migrate. sanjay PS. J Rottinghuis expressed a similar sentiment in a previous email in this thread: On Sep 18, 2013, at 5:11 PM, J. Rottinghuis wrote: I like symlink functionality, but in our migration to Hadoop 2.x this is a total distraction. If the APIs stay in 2.2 GA we'll have to choose to: a) Not uprev until symlink support is figured out up and down the stack, and we've been able to migrate all our 1.x (equivalent) clusters to 2.x (equivalent). Or b) rip out the API altogether. Or c) change the implementation to throw an UnsupportedOperationException I'm not sure yet which of these I like least. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
A side note on the protobuf versions, you can have a client and a server using different versions of protobuf, that works and it works well. What you cannot do is compile with protoc version X and run using the JAR from version Y. On Thu, Sep 19, 2013 at 2:11 AM, J. Rottinghuis jrottingh...@gmail.comwrote: However painful protobuf version changes are at build time for Hadoop developers, at runtime with multiple clusters and many Hadoop users this is a total nightmare. Even upgrading clusters from one protobuf version to the next is going to be very difficult. The same users will run jobs on, and/or readwrite to multiple clusters. That means that they will have to fork their code, run multiple instances? Or in the very least they have to do an update to their applications. All in sync with Hadoop cluster changes. And these are not doable in a rolling fashion. All Hadoop and HBase clusters will all upgrade at the same time, or we'll have to have our users fork / roll multiple versions ? My point is that these things are much harder than just fix the (Jenkins) build and we're done. These changes are massively disruptive. There is a similar situation with symlinks. Having an API that lets users create symlinks is very problematic. Some users create symlinks and as Eli pointed out, somebody else (or automated process) tries to copy to / from another (Hadoop 1.x?) cluster over hftp. What will happen ? Having an API that people should not use is also a nightmare. We experienced this with append. For a while it was there, but users were not allowed to use it (or else there were large #'s of corrupt blocks). If there is an API to create a symlink, then some of our users are going to use it and others are going to trip over those symlinks. We already know that Pig does not work with symlinks yet, and as Steve pointed out, there is tons of other code out there that assumes that !isDir() means isFile(). I like symlink functionality, but in our migration to Hadoop 2.x this is a total distraction. If the APIs stay in 2.2 GA we'll have to choose to: a) Not uprev until symlink support is figured out up and down the stack, and we've been able to migrate all our 1.x (equivalent) clusters to 2.x (equivalent). Or b) rip out the API altogether. Or c) change the implementation to throw an UnsupportedOperationException I'm not sure yet which of these I like least. Thanks, Joep On Wed, Sep 18, 2013 at 9:48 AM, Arun C Murthy a...@hortonworks.com wrote: On Sep 16, 2013, at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. Just catching up, is this an incompatible change, or not? The above reads 'not an incompatible change'. Arun However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Alejandro
Re: symlink support in Hadoop 2 GA
What we're trying to get to here is a consensus on whether FileSystem#listStatus and FileSystem#globStatus should return symlinks __as_symlinks__. If 2.1-beta goes out with these semantics, I think we are not going to be able to change them later. That is what will happen in the do nothing scenario. Also see Jason Lowe's comment here: https://issues.apache.org/jira/browse/HADOOP-9912?focusedCommentId=13772002page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13772002 Colin On Wed, Sep 18, 2013 at 5:11 PM, J. Rottinghuis jrottingh...@gmail.com wrote: However painful protobuf version changes are at build time for Hadoop developers, at runtime with multiple clusters and many Hadoop users this is a total nightmare. Even upgrading clusters from one protobuf version to the next is going to be very difficult. The same users will run jobs on, and/or readwrite to multiple clusters. That means that they will have to fork their code, run multiple instances? Or in the very least they have to do an update to their applications. All in sync with Hadoop cluster changes. And these are not doable in a rolling fashion. All Hadoop and HBase clusters will all upgrade at the same time, or we'll have to have our users fork / roll multiple versions ? My point is that these things are much harder than just fix the (Jenkins) build and we're done. These changes are massively disruptive. There is a similar situation with symlinks. Having an API that lets users create symlinks is very problematic. Some users create symlinks and as Eli pointed out, somebody else (or automated process) tries to copy to / from another (Hadoop 1.x?) cluster over hftp. What will happen ? Having an API that people should not use is also a nightmare. We experienced this with append. For a while it was there, but users were not allowed to use it (or else there were large #'s of corrupt blocks). If there is an API to create a symlink, then some of our users are going to use it and others are going to trip over those symlinks. We already know that Pig does not work with symlinks yet, and as Steve pointed out, there is tons of other code out there that assumes that !isDir() means isFile(). I like symlink functionality, but in our migration to Hadoop 2.x this is a total distraction. If the APIs stay in 2.2 GA we'll have to choose to: a) Not uprev until symlink support is figured out up and down the stack, and we've been able to migrate all our 1.x (equivalent) clusters to 2.x (equivalent). Or b) rip out the API altogether. Or c) change the implementation to throw an UnsupportedOperationException I'm not sure yet which of these I like least. Thanks, Joep On Wed, Sep 18, 2013 at 9:48 AM, Arun C Murthy a...@hortonworks.com wrote: On Sep 16, 2013, at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. Just catching up, is this an incompatible change, or not? The above reads 'not an incompatible change'. Arun However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
On 17 September 2013 23:05, Eli Collins e...@cloudera.com wrote: (Looping in Arun since this impacts 2.x releases) I updated the versions on HADOOP-8040 and sub-tasks to reflect where the changes have landed. All of these changes (modulo HADOOP-9417) were merged to branch-2.1 and are in the 2.1.0 release. While symlinks are in 2.1.0 I don't think we can really claim they're ready until issues like HADOOP-9912 are resolved, and they are supported in the shell, distcp and WebHDFS/HttpFS/Hftp (these are not esoteric!). Someone can create a symlink with FileSystem causing someone else's distcp job to fail. Unlikely given they're not exposed outside the Java API but still not great. Ideally this work would have been done on a feature branch and then merged when complete, but that's water under the bridge. I see the following options: 1. Fixup the current symlink support so that symlinks are ready for 2.2 (GA), or at least the public APIs. This means the APIs will be in GA from the get go so while the functionality might be fully baked we don't have to worry about incompatible changes like FileStatus#isDir() changing behavior in 2.3 or a later update. The downside is this will take at least a couple weeks (to resolve HADOOP-9912 and potentially implement the remaining pieces) and so may impact the 2.2 release timing. This option means 2.2 won't remove the new APIs introduced in 2.1. We'd want to spin a 2.1.2 beta with the new API changes so we don't introduce new APIs in the beta to GA transition. I'm reluctant for this as while delaying the release, because we are going to find problems all the way up the stack -which will require a choreographed set of changes. Given the grief of the protbuf update, I don't want to go near that just before the final release. We already have lots of 1.x era code that assume !isDir() == isFile() -I know that from spending lots of time in the FS specification layer. That's something which is going to break with Symlinks, irrespective of when the feature is rolled out. The other thing we have to do is push back the API changes into 1.x, at least at the FileSystem interface layer, so that code which uses IsDirectory, isSymlink, etc does not need to be edited to compile run against both versions. I know Chris Nauroth has been doing this, but think we need to make sure it is all there. This will let things like Pig compile against all versions with symlink-ready code. The other issues is thatit goes on to increase the pressure to get other features in there hey, we've got 2 more weeks! let's add X!(where for me, X:={HADOOP-8545, some restrictions on valid names of app types instance names for YARN, ...). My vote then: freeze and ship. We're happy with the wire formats, the API has added knowledge of Symlink and Filesystem features can evolve afterwards -with layers above handling the changes. 2. Revert symlinks from branch-2.1-beta and branch-2. Finish up the work in trunk (or a feature branch) and merge for a subsequent 2.x update. While this helps get us to GA faster it would be preferable to get an API change like this in for 2.2 GA since they may be disruptive to introduce in an update (eg see example in #1). And of course our users would like symlinks functionality in the GA release. This option would mean 2.2 is incompatible with 2.1 because it's dropping the new APIs, not ideal for a beta to GA transition. Why just ship as is, with a note symlinks not live yet, leave alone. That's what's been in the betas to date. 3. Revert and punt symlinks to 3.x. IMO should be the last resort. I'd prefer it in 2.3 -which is where I'm targeting all my feature creep. IMO 2.1 is frozen except for bug fixes -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
On Wed, Sep 18, 2013 at 11:29 AM, Steve Loughran ste...@hortonworks.comwrote: I'm reluctant for this as while delaying the release, because we are going to find problems all the way up the stack -which will require a choreographed set of changes. Given the grief of the protbuf update, I don't want to go near that just before the final release. Well, I would use the exact same argument used for protobuf (which only complication was getting protoc 2.5.0 in the jenkins boxes and communicate developers to do the same, other than that we didn't hit any other issue AFAIK) ... IMO, it makes more sense to do this change during the beta rather than when GA. That gives us more flexibility to iron out things if necessary. thx -- Alejandro
Re: symlink support in Hadoop 2 GA
On 18 September 2013 12:53, Alejandro Abdelnur t...@cloudera.com wrote: On Wed, Sep 18, 2013 at 11:29 AM, Steve Loughran ste...@hortonworks.com wrote: I'm reluctant for this as while delaying the release, because we are going to find problems all the way up the stack -which will require a choreographed set of changes. Given the grief of the protbuf update, I don't want to go near that just before the final release. Well, I would use the exact same argument used for protobuf (which only complication was getting protoc 2.5.0 in the jenkins boxes and communicate developers to do the same, other than that we didn't hit any other issue AFAIK) ... protobuf was traumatic at build time, as I recall because it was neither forwards or backwards compatible. Those of us trying to build different branches had to choose which version to have on the path, or set up scripts to do the switching. HBase needed rebuilding, so did other things. And I still have the pain of downloading and installing protoc on all Linux VMs I build up going forward, until apt-get and yum have protoc 2.5 artifacts. This means it was very painful for developer, added a lot of late breaking pain to the developers, but it had one key feature that gave it an edge: it was immediately obvious where you had a problem as things didn't compile or classload without linkage problems. No latent bugs, unless protobuf 2.5 has them internally -for which we have to rely on google's release testing to have found. That is a lot simpler to regression test than adding any new feature to HDFS and seeing what breaks -as that is something that only surfaces out in the field. Which is why I think it's too late in the 2.1 release timetable to add symlinks. We've had a 2.1-beta out there, we've got feedback. Fix those problems that are show stoppers, but don't add more stuff. Which is precisely why I have not been pushing in any of my recent changes. I may seem ruthless arguing against symlinks -but I'm not being inconsistent with my own commit history. The only two things I've put in branch-2.1 since beta-1 were a separate log for the Configuration deprecation warnings and a patch to the POM for a java7 build on OSX: and they weren't even my patches. -Steve (One of these days I should volunteer to be the release manager and it'll be obvious that Arun is being quite amenable to all the other developers) IMO, it makes more sense to do this change during the beta rather than when GA. That gives us more flexibility to iron out things if necessary. I'm arguing this change can go into the beta of the successor to 2.1 -not GA. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
On Wed, Sep 18, 2013 at 5:45 AM, Steve Loughran ste...@hortonworks.comwrote: On 18 September 2013 12:53, Alejandro Abdelnur t...@cloudera.com wrote: On Wed, Sep 18, 2013 at 11:29 AM, Steve Loughran ste...@hortonworks.com wrote: I'm reluctant for this as while delaying the release, because we are going to find problems all the way up the stack -which will require a choreographed set of changes. Given the grief of the protbuf update, I don't want to go near that just before the final release. Well, I would use the exact same argument used for protobuf (which only complication was getting protoc 2.5.0 in the jenkins boxes and communicate developers to do the same, other than that we didn't hit any other issue AFAIK) ... protobuf was traumatic at build time, as I recall because it was neither forwards or backwards compatible. Those of us trying to build different branches had to choose which version to have on the path, or set up scripts to do the switching. HBase needed rebuilding, so did other things. And I still have the pain of downloading and installing protoc on all Linux VMs I build up going forward, until apt-get and yum have protoc 2.5 artifacts. This means it was very painful for developer, added a lot of late breaking pain to the developers, but it had one key feature that gave it an edge: it was immediately obvious where you had a problem as things didn't compile or classload without linkage problems. No latent bugs, unless protobuf 2.5 has them internally -for which we have to rely on google's release testing to have found. That is a lot simpler to regression test than adding any new feature to HDFS and seeing what breaks -as that is something that only surfaces out in the field. Which is why I think it's too late in the 2.1 release timetable to add symlinks. We've had a 2.1-beta out there, we've got feedback. Fix those problems that are show stoppers, but don't add more stuff. Which is precisely why I have not been pushing in any of my recent changes. I may seem ruthless arguing against symlinks -but I'm not being inconsistent with my own commit history. The only two things I've put in branch-2.1 since beta-1 were a separate log for the Configuration deprecation warnings and a patch to the POM for a java7 build on OSX: and they weren't even my patches. -Steve (One of these days I should volunteer to be the release manager and it'll be obvious that Arun is being quite amenable to all the other developers) IMO, it makes more sense to do this change during the beta rather than when GA. That gives us more flexibility to iron out things if necessary. I'm arguing this change can go into the beta of the successor to 2.1 -not GA. What does this change refer to? Symlinks are already in 2.1, and the existing semantics create problems for programs (eg see the pig example in HADOOP-9912) that we need to resolve. I don't think do nothing is an option for 2.2. GA. Thanks, Eli -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
On Sep 16, 2013, at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. Just catching up, is this an incompatible change, or not? The above reads 'not an incompatible change'. Arun However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
However painful protobuf version changes are at build time for Hadoop developers, at runtime with multiple clusters and many Hadoop users this is a total nightmare. Even upgrading clusters from one protobuf version to the next is going to be very difficult. The same users will run jobs on, and/or readwrite to multiple clusters. That means that they will have to fork their code, run multiple instances? Or in the very least they have to do an update to their applications. All in sync with Hadoop cluster changes. And these are not doable in a rolling fashion. All Hadoop and HBase clusters will all upgrade at the same time, or we'll have to have our users fork / roll multiple versions ? My point is that these things are much harder than just fix the (Jenkins) build and we're done. These changes are massively disruptive. There is a similar situation with symlinks. Having an API that lets users create symlinks is very problematic. Some users create symlinks and as Eli pointed out, somebody else (or automated process) tries to copy to / from another (Hadoop 1.x?) cluster over hftp. What will happen ? Having an API that people should not use is also a nightmare. We experienced this with append. For a while it was there, but users were not allowed to use it (or else there were large #'s of corrupt blocks). If there is an API to create a symlink, then some of our users are going to use it and others are going to trip over those symlinks. We already know that Pig does not work with symlinks yet, and as Steve pointed out, there is tons of other code out there that assumes that !isDir() means isFile(). I like symlink functionality, but in our migration to Hadoop 2.x this is a total distraction. If the APIs stay in 2.2 GA we'll have to choose to: a) Not uprev until symlink support is figured out up and down the stack, and we've been able to migrate all our 1.x (equivalent) clusters to 2.x (equivalent). Or b) rip out the API altogether. Or c) change the implementation to throw an UnsupportedOperationException I'm not sure yet which of these I like least. Thanks, Joep On Wed, Sep 18, 2013 at 9:48 AM, Arun C Murthy a...@hortonworks.com wrote: On Sep 16, 2013, at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. Just catching up, is this an incompatible change, or not? The above reads 'not an incompatible change'. Arun However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
I think it makes sense to finish symlinks support in the Hadoop 2 GA release. Colin On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew
Re: symlink support in Hadoop 2 GA
I agree that this is an important change. However, 2.2.0 GA is getting ready to rollout in weeks. I am concerned that these changes will add not only incompatible changes late in the game, but also possibly instability. Java API incompatibility is some thing we have avoided for the most part and I am concerned that this is adding such incompatibility in FileSystem APIs. We should find work arounds by adding possibly newer APIs and leaving existing APIs as is. If this can be done, my vote is to enable this feature in 2.3. Even if it cannot be done, I am concerned that this is coming quite late and we should see if could allow some incompatible changes into 2.3 for this feature. On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.comwrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew -- http://hortonworks.com/download/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
The issue is not modifying existing APIs. The issue is that code has been written that makes assumptions that are incompatible with the existence of things that are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. Faced with this, we have considered making the default behavior of listStatus and globStatus to be fully resolving symlinks, and simply not listing dangling symlinks. Code which is prepared to deal symlinks can use newer versions of the listStatus and globStatus functions which do return symlinks as symlinks. We might consider defaulting FileSystem#listStatus and FileSystem#globStatus to fully resolving symlinks by default and defaulting FileContext#listStatus and FileContext#Util#globStatus to the opposite. This seems like the maximally compatible solution that we're going to get. I think this makes sense. The alternative is kicking the can down the road to Hadoop 3, and letting vendors of alternative (including some proprietary alternative) systems continue to claim that Hadoop doesn't support symlinks yet (with some justice). P.S. I would be fine with putting this in 2.2 or 2.3 if that seems more appropriate. sincerely, Colin On Tue, Sep 17, 2013 at 8:23 AM, Suresh Srinivas sur...@hortonworks.com wrote: I agree that this is an important change. However, 2.2.0 GA is getting ready to rollout in weeks. I am concerned that these changes will add not only incompatible changes late in the game, but also possibly instability. Java API incompatibility is some thing we have avoided for the most part and I am concerned that this is adding such incompatibility in FileSystem APIs. We should find work arounds by adding possibly newer APIs and leaving existing APIs as is. If this can be done, my vote is to enable this feature in 2.3. Even if it cannot be done, I am concerned that this is coming quite late and we should see if could allow some incompatible changes into 2.3 for this feature. On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.comwrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew -- http://hortonworks.com/download/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: symlink support in Hadoop 2 GA
I encourage interested parties to read through HADOOP-9912 to get a feel for the issues. There really is no way to add symlink support without changing the behavior of existing APIs. Ultimately, anything that returns a FileStatus is going to be different. Even if we default to resolving symlinks, resolving can lead to FileNotFound or permission errors. Thus, we have to choose whether to prune the bad links, show the bad links as dangling, or throwing an exception. None of these options are compatible. I'm really concerned about putting this in a minor release like 2.3 since it has the potential to break a lot of user code. HADOOP-9912 is an example from within our own ecosystem, but think of all the custom user code out there written against FileSystem. 2.2 GA is basically our last chance to make this kind of change before Hadoop 3. Thanks, Andrew On Tue, Sep 17, 2013 at 9:10 AM, Colin McCabe cmcc...@alumni.cmu.eduwrote: The issue is not modifying existing APIs. The issue is that code has been written that makes assumptions that are incompatible with the existence of things that are not files or directories. For example, there is a lot of code out there that looks at FileStatus#isFile, and if it returns false, assumes that what it is looking at is a directory. In the case of a symlink, this assumption is incorrect. Faced with this, we have considered making the default behavior of listStatus and globStatus to be fully resolving symlinks, and simply not listing dangling symlinks. Code which is prepared to deal symlinks can use newer versions of the listStatus and globStatus functions which do return symlinks as symlinks. We might consider defaulting FileSystem#listStatus and FileSystem#globStatus to fully resolving symlinks by default and defaulting FileContext#listStatus and FileContext#Util#globStatus to the opposite. This seems like the maximally compatible solution that we're going to get. I think this makes sense. The alternative is kicking the can down the road to Hadoop 3, and letting vendors of alternative (including some proprietary alternative) systems continue to claim that Hadoop doesn't support symlinks yet (with some justice). P.S. I would be fine with putting this in 2.2 or 2.3 if that seems more appropriate. sincerely, Colin On Tue, Sep 17, 2013 at 8:23 AM, Suresh Srinivas sur...@hortonworks.com wrote: I agree that this is an important change. However, 2.2.0 GA is getting ready to rollout in weeks. I am concerned that these changes will add not only incompatible changes late in the game, but also possibly instability. Java API incompatibility is some thing we have avoided for the most part and I am concerned that this is adding such incompatibility in FileSystem APIs. We should find work arounds by adding possibly newer APIs and leaving existing APIs as is. If this can be done, my vote is to enable this feature in 2.3. Even if it cannot be done, I am concerned that this is coming quite late and we should see if could allow some incompatible changes into 2.3 for this feature. On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew -- http://hortonworks.com/download/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error,
Re: symlink support in Hadoop 2 GA
(Looping in Arun since this impacts 2.x releases) I updated the versions on HADOOP-8040 and sub-tasks to reflect where the changes have landed. All of these changes (modulo HADOOP-9417) were merged to branch-2.1 and are in the 2.1.0 release. While symlinks are in 2.1.0 I don't think we can really claim they're ready until issues like HADOOP-9912 are resolved, and they are supported in the shell, distcp and WebHDFS/HttpFS/Hftp (these are not esoteric!). Someone can create a symlink with FileSystem causing someone else's distcp job to fail. Unlikely given they're not exposed outside the Java API but still not great. Ideally this work would have been done on a feature branch and then merged when complete, but that's water under the bridge. I see the following options: 1. Fixup the current symlink support so that symlinks are ready for 2.2 (GA), or at least the public APIs. This means the APIs will be in GA from the get go so while the functionality might be fully baked we don't have to worry about incompatible changes like FileStatus#isDir() changing behavior in 2.3 or a later update. The downside is this will take at least a couple weeks (to resolve HADOOP-9912 and potentially implement the remaining pieces) and so may impact the 2.2 release timing. This option means 2.2 won't remove the new APIs introduced in 2.1. We'd want to spin a 2.1.2 beta with the new API changes so we don't introduce new APIs in the beta to GA transition. 2. Revert symlinks from branch-2.1-beta and branch-2. Finish up the work in trunk (or a feature branch) and merge for a subsequent 2.x update. While this helps get us to GA faster it would be preferable to get an API change like this in for 2.2 GA since they may be disruptive to introduce in an update (eg see example in #1). And of course our users would like symlinks functionality in the GA release. This option would mean 2.2 is incompatible with 2.1 because it's dropping the new APIs, not ideal for a beta to GA transition. 3. Revert and punt symlinks to 3.x. IMO should be the last resort. If we have sufficient time I think option #1 would be best. What do others think? Thanks, Eli On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote: Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew
symlink support in Hadoop 2 GA
Hi all, I wanted to broadcast plans for putting the FileSystem symlinks work (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think it's pretty important we get it in since it's not a compatible change; if it misses the GA train, we're not going to have symlinks until the next major release. However, we're still dealing with ongoing issues revealed via testing. There's user-code out there that only handles files and directories and will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912 for a nice example where globStatus returning symlinks broke Pig; some of us had a conference call to talk it through, and one definite conclusion was that this wasn't solvable in a generally compatible manner. There are also still some gaps in symlink support right now. For example, the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink resolution, and tooling like the FsShell and Distcp still need to be updated as well. So, there's definitely work to be done, but there are a lot of users interested in the feature, and symlinks really should be in GA. Would appreciate any thoughts/input on the matter. Thanks, Andrew