Re: symlink support in Hadoop 2 GA

2013-10-04 Thread Andrew Wang
Colin posted a summary of our phone call yesterday (attendees: myself,
Colin, Daryn, Nathan, Jason, Chris, Suresh, Sanjay) on HADOOP-9984:

https://issues.apache.org/jira/browse/HADOOP-9984?focusedCommentId=13785701page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13785701

Pasted here:


   - We discussed alternatives to
HADOOP-9984https://issues.apache.org/jira/browse/HADOOP-9984,
   but concluded that they weren't workable.
   - We agreed that doing the symlink resolution in each Filesystem
   subclass is what we ought to do in 9984, in order to keep compatibility
   with out-of-tree filesystems.
   - We agreed to disable symlink resolution in Hadoop 2 GA. We will spend
   a few weeks ironing out all the bugs and enable it in Hadoop 2.3. However,
   we would like to make all backwards-incompatible API changes prior to
   Hadoop 2 GA.
   - We agreed that
HADOOP-9972https://issues.apache.org/jira/browse/HADOOP-9972 (new
   symlink-aware API for globStatus) should get into Hadoop 2 GA.
   - We discussed the issue of returning resolved paths versus unresolved
   paths, but were unable to come to any conclusion. Everyone agreed that
   there would be serious performance problems if we returned unresolved
   paths, but some claimed that programs would break when encountering
   resolved paths.


There's also a new umbrella issue at HADOOP-10019 tracking on-going
symlinks changes.

Best,
Andrew


On Thu, Oct 3, 2013 at 2:08 PM, Daryn Sharp da...@yahoo-inc.com wrote:

 I reluctantly agree that we should disable symlinks in 2.2 until we can
 sort out the compatibility issues.  I'm reluctant in the sense that its a
 feature users have long wanted, and it's something we'd like to use from an
 administrative view.  However I don't see all the issues being shorted out
 in the very near future.

 I filed some jiras today that have led me to believe that the current
 implementation of fs symlinks is irreparably flawed.  Adding optional
 primitives to filesystems to make them symlink capable is ok.  However,
 adding symlink resolution to individual filesystems is fundamentally
 broken.  It doesn't work for stacked filesystems (viewfs, chroots, filters,
 etc) because the resolution must occur at the highest level, not within an
 individual filesystem itself.  Otherwise the abstraction of the top-level
 filesystem is violated and all kinds of unexpected behavior like walking
 out of chroots becomes possible.

 Daryn

 On Oct 3, 2013, at 1:39 PM, sanjay Radia wrote:

  There are a number of issues (some minor, some more than minor).
  GA is close and we are are still in discussion on the some of them;
 while I believe we will close on these very very shortly, code change like
 this so close to GA is dangerous.
 
  I suggest we do the following:
  1) Disable Symlinks  in 2.2 GA- throw unsupported exception on
 createSymlink in both FileSystem and FileContext.
  2) Deal with the  isDir() in 2.2GA in preparation for item 3 coming
 after GA:
a) Deprecate isDir()
 b) Add a new API that returns an enum (see FileContext).
  3) Fix Symlinks, in a future release, hopefully the very next one after
 2.2GA
a)  change the stack to use the new API replacing isDir().
b) fix isDIr() to do something smarter (we can detail this later but
 there is a solution that has been discussed). This helps customer
 applications that call isDir().
   c) Remove isDir in a future release when customers have had sufficient
 time to migrate.
 
  sanjay
 
  PS. J Rottinghuis expressed a similar sentiment in a previous email in
 this thread:
 
 
 
  On Sep 18, 2013, at 5:11 PM, J. Rottinghuis wrote:
 
  I like symlink functionality, but in our migration to Hadoop 2.x this
 is a
  total distraction. If the APIs stay in 2.2 GA we'll have to choose to:
  a) Not uprev until symlink support is figured out up and down the stack,
  and we've been able to migrate all our 1.x (equivalent) clusters to 2.x
  (equivalent). Or
  b) rip out the API altogether. Or
  c) change the implementation to throw an UnsupportedOperationException
  I'm not sure yet which of these I like least.
 
 
  --
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the reader
  of this message is not the intended recipient, you are hereby notified
 that
  any printing, copying, dissemination, distribution, disclosure or
  forwarding of this communication is strictly prohibited. If you have
  received this communication in error, please contact the sender
 immediately
  and delete it from your system. Thank You.




Re: symlink support in Hadoop 2 GA

2013-10-03 Thread sanjay Radia
There are a number of issues (some minor, some more than minor).
GA is close and we are are still in discussion on the some of them; while I 
believe we will close on these very very shortly, code change like this so 
close to GA is dangerous.

I suggest we do the following:
1) Disable Symlinks  in 2.2 GA- throw unsupported exception on createSymlink in 
both FileSystem and FileContext.
2) Deal with the  isDir() in 2.2GA in preparation for item 3 coming after GA:
a) Deprecate isDir()
b) Add a new API that returns an enum (see FileContext).
3) Fix Symlinks, in a future release, hopefully the very next one after 2.2GA
   a)  change the stack to use the new API replacing isDir(). 
   b) fix isDIr() to do something smarter (we can detail this later but there 
is a solution that has been discussed). This helps customer applications that 
call isDir(). 
  c) Remove isDir in a future release when customers have had sufficient time 
to migrate.

sanjay

PS. J Rottinghuis expressed a similar sentiment in a previous email in this 
thread:



On Sep 18, 2013, at 5:11 PM, J. Rottinghuis wrote:

 I like symlink functionality, but in our migration to Hadoop 2.x this is a
 total distraction. If the APIs stay in 2.2 GA we'll have to choose to:
 a) Not uprev until symlink support is figured out up and down the stack,
 and we've been able to migrate all our 1.x (equivalent) clusters to 2.x
 (equivalent). Or
 b) rip out the API altogether. Or
 c) change the implementation to throw an UnsupportedOperationException
 I'm not sure yet which of these I like least.


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: symlink support in Hadoop 2 GA

2013-10-03 Thread Daryn Sharp
I reluctantly agree that we should disable symlinks in 2.2 until we can sort 
out the compatibility issues.  I'm reluctant in the sense that its a feature 
users have long wanted, and it's something we'd like to use from an 
administrative view.  However I don't see all the issues being shorted out in 
the very near future.

I filed some jiras today that have led me to believe that the current 
implementation of fs symlinks is irreparably flawed.  Adding optional 
primitives to filesystems to make them symlink capable is ok.  However, adding 
symlink resolution to individual filesystems is fundamentally broken.  It 
doesn't work for stacked filesystems (viewfs, chroots, filters, etc) because 
the resolution must occur at the highest level, not within an individual 
filesystem itself.  Otherwise the abstraction of the top-level filesystem is 
violated and all kinds of unexpected behavior like walking out of chroots 
becomes possible.

Daryn

On Oct 3, 2013, at 1:39 PM, sanjay Radia wrote:

 There are a number of issues (some minor, some more than minor).
 GA is close and we are are still in discussion on the some of them; while I 
 believe we will close on these very very shortly, code change like this so 
 close to GA is dangerous.
 
 I suggest we do the following:
 1) Disable Symlinks  in 2.2 GA- throw unsupported exception on createSymlink 
 in both FileSystem and FileContext.
 2) Deal with the  isDir() in 2.2GA in preparation for item 3 coming after GA:
   a) Deprecate isDir()
b) Add a new API that returns an enum (see FileContext).
 3) Fix Symlinks, in a future release, hopefully the very next one after 2.2GA
   a)  change the stack to use the new API replacing isDir(). 
   b) fix isDIr() to do something smarter (we can detail this later but there 
 is a solution that has been discussed). This helps customer applications that 
 call isDir(). 
  c) Remove isDir in a future release when customers have had sufficient time 
 to migrate.
 
 sanjay
 
 PS. J Rottinghuis expressed a similar sentiment in a previous email in this 
 thread:
 
 
 
 On Sep 18, 2013, at 5:11 PM, J. Rottinghuis wrote:
 
 I like symlink functionality, but in our migration to Hadoop 2.x this is a
 total distraction. If the APIs stay in 2.2 GA we'll have to choose to:
 a) Not uprev until symlink support is figured out up and down the stack,
 and we've been able to migrate all our 1.x (equivalent) clusters to 2.x
 (equivalent). Or
 b) rip out the API altogether. Or
 c) change the implementation to throw an UnsupportedOperationException
 I'm not sure yet which of these I like least.
 
 
 -- 
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to 
 which it is addressed and may contain information that is confidential, 
 privileged and exempt from disclosure under applicable law. If the reader 
 of this message is not the intended recipient, you are hereby notified that 
 any printing, copying, dissemination, distribution, disclosure or 
 forwarding of this communication is strictly prohibited. If you have 
 received this communication in error, please contact the sender immediately 
 and delete it from your system. Thank You.



Re: symlink support in Hadoop 2 GA

2013-09-19 Thread Alejandro Abdelnur
A side note on the protobuf versions, you can have a client and a server
using different versions of protobuf, that works and it works well. What
you cannot do is compile with protoc version X and run using the JAR from
version Y.


On Thu, Sep 19, 2013 at 2:11 AM, J. Rottinghuis jrottingh...@gmail.comwrote:

 However painful protobuf version changes are at build time for Hadoop
 developers, at runtime with multiple clusters and many Hadoop users this is
 a total nightmare.
 Even upgrading clusters from one protobuf version to the next is going to
 be very difficult. The same users will run jobs on, and/or readwrite to
 multiple clusters. That means that they will have to fork their code, run
 multiple instances? Or in the very least they have to do an update to their
 applications. All in sync with Hadoop cluster changes. And these are not
 doable in a rolling fashion.
 All Hadoop and HBase clusters will all upgrade at the same time, or we'll
 have to have our users fork / roll multiple versions ?
 My point is that these things are much harder than just fix the (Jenkins)
 build and we're done. These changes are massively disruptive.

 There is a similar situation with symlinks. Having an API that lets users
 create symlinks is very problematic. Some users create symlinks and as Eli
 pointed out, somebody else (or automated process) tries to copy to / from
 another (Hadoop 1.x?) cluster over hftp. What will happen ?
 Having an API that people should not use is also a nightmare. We
 experienced this with append. For a while it was there, but users were not
 allowed to use it (or else there were large #'s of corrupt blocks). If
 there is an API to create a symlink, then some of our users are going to
 use it and others are going to trip over those symlinks. We already know
 that Pig does not work with symlinks yet, and as Steve pointed out, there
 is tons of other code out there that assumes that !isDir() means isFile().

 I like symlink functionality, but in our migration to Hadoop 2.x this is a
 total distraction. If the APIs stay in 2.2 GA we'll have to choose to:
 a) Not uprev until symlink support is figured out up and down the stack,
 and we've been able to migrate all our 1.x (equivalent) clusters to 2.x
 (equivalent). Or
 b) rip out the API altogether. Or
 c) change the implementation to throw an UnsupportedOperationException
 I'm not sure yet which of these I like least.

 Thanks,

 Joep




 On Wed, Sep 18, 2013 at 9:48 AM, Arun C Murthy a...@hortonworks.com
 wrote:

 
  On Sep 16, 2013, at 6:49 PM, Andrew Wang andrew.w...@cloudera.com
 wrote:
 
   Hi all,
  
   I wanted to broadcast plans for putting the FileSystem symlinks work
   (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I
  think
   it's pretty important we get it in since it's not a compatible change;
 if
   it misses the GA train, we're not going to have symlinks until the next
   major release.
 
  Just catching up, is this an incompatible change, or not? The above reads
  'not an incompatible change'.
 
  Arun
 
  
   However, we're still dealing with ongoing issues revealed via testing.
   There's user-code out there that only handles files and directories and
   will barf when given a symlink (perhaps a dangling one!). See
 HADOOP-9912
   for a nice example where globStatus returning symlinks broke Pig; some
 of
   us had a conference call to talk it through, and one definite
 conclusion
   was that this wasn't solvable in a generally compatible manner.
  
   There are also still some gaps in symlink support right now. For
 example,
   the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need
 symlink
   resolution, and tooling like the FsShell and Distcp still need to be
   updated as well.
  
   So, there's definitely work to be done, but there are a lot of users
   interested in the feature, and symlinks really should be in GA. Would
   appreciate any thoughts/input on the matter.
  
   Thanks,
   Andrew
 
  --
  Arun C. Murthy
  Hortonworks Inc.
  http://hortonworks.com/
 
 
 
  --
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the reader
  of this message is not the intended recipient, you are hereby notified
 that
  any printing, copying, dissemination, distribution, disclosure or
  forwarding of this communication is strictly prohibited. If you have
  received this communication in error, please contact the sender
 immediately
  and delete it from your system. Thank You.
 




-- 
Alejandro


Re: symlink support in Hadoop 2 GA

2013-09-19 Thread Colin McCabe
What we're trying to get to here is a consensus on whether
FileSystem#listStatus and FileSystem#globStatus should return symlinks
__as_symlinks__.  If 2.1-beta goes out with these semantics, I think
we are not going to be able to change them later.  That is what will
happen in the do nothing scenario.

Also see Jason Lowe's comment here:
https://issues.apache.org/jira/browse/HADOOP-9912?focusedCommentId=13772002page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13772002

Colin


On Wed, Sep 18, 2013 at 5:11 PM, J. Rottinghuis jrottingh...@gmail.com wrote:
 However painful protobuf version changes are at build time for Hadoop
 developers, at runtime with multiple clusters and many Hadoop users this is
 a total nightmare.
 Even upgrading clusters from one protobuf version to the next is going to
 be very difficult. The same users will run jobs on, and/or readwrite to
 multiple clusters. That means that they will have to fork their code, run
 multiple instances? Or in the very least they have to do an update to their
 applications. All in sync with Hadoop cluster changes. And these are not
 doable in a rolling fashion.
 All Hadoop and HBase clusters will all upgrade at the same time, or we'll
 have to have our users fork / roll multiple versions ?
 My point is that these things are much harder than just fix the (Jenkins)
 build and we're done. These changes are massively disruptive.

 There is a similar situation with symlinks. Having an API that lets users
 create symlinks is very problematic. Some users create symlinks and as Eli
 pointed out, somebody else (or automated process) tries to copy to / from
 another (Hadoop 1.x?) cluster over hftp. What will happen ?
 Having an API that people should not use is also a nightmare. We
 experienced this with append. For a while it was there, but users were not
 allowed to use it (or else there were large #'s of corrupt blocks). If
 there is an API to create a symlink, then some of our users are going to
 use it and others are going to trip over those symlinks. We already know
 that Pig does not work with symlinks yet, and as Steve pointed out, there
 is tons of other code out there that assumes that !isDir() means isFile().

 I like symlink functionality, but in our migration to Hadoop 2.x this is a
 total distraction. If the APIs stay in 2.2 GA we'll have to choose to:
 a) Not uprev until symlink support is figured out up and down the stack,
 and we've been able to migrate all our 1.x (equivalent) clusters to 2.x
 (equivalent). Or
 b) rip out the API altogether. Or
 c) change the implementation to throw an UnsupportedOperationException
 I'm not sure yet which of these I like least.

 Thanks,

 Joep




 On Wed, Sep 18, 2013 at 9:48 AM, Arun C Murthy a...@hortonworks.com wrote:


 On Sep 16, 2013, at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote:

  Hi all,
 
  I wanted to broadcast plans for putting the FileSystem symlinks work
  (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I
 think
  it's pretty important we get it in since it's not a compatible change; if
  it misses the GA train, we're not going to have symlinks until the next
  major release.

 Just catching up, is this an incompatible change, or not? The above reads
 'not an incompatible change'.

 Arun

 
  However, we're still dealing with ongoing issues revealed via testing.
  There's user-code out there that only handles files and directories and
  will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
  for a nice example where globStatus returning symlinks broke Pig; some of
  us had a conference call to talk it through, and one definite conclusion
  was that this wasn't solvable in a generally compatible manner.
 
  There are also still some gaps in symlink support right now. For example,
  the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
  resolution, and tooling like the FsShell and Distcp still need to be
  updated as well.
 
  So, there's definitely work to be done, but there are a lot of users
  interested in the feature, and symlinks really should be in GA. Would
  appreciate any thoughts/input on the matter.
 
  Thanks,
  Andrew

 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/



 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Re: symlink support in Hadoop 2 GA

2013-09-18 Thread Steve Loughran
On 17 September 2013 23:05, Eli Collins e...@cloudera.com wrote:

 (Looping in Arun since this impacts 2.x releases)

 I updated the versions on HADOOP-8040 and sub-tasks to reflect where
 the changes have landed. All of these changes (modulo HADOOP-9417)
 were merged to branch-2.1 and are in the 2.1.0 release.

 While symlinks are in 2.1.0 I don't think we can really claim they're
 ready until issues like HADOOP-9912 are resolved, and they are
 supported in the shell, distcp and WebHDFS/HttpFS/Hftp (these are not
 esoteric!).  Someone can create a symlink with FileSystem causing
 someone else's distcp job to fail. Unlikely given they're not exposed
 outside the Java API but still not great.   Ideally this work would
 have been done on a feature branch and then merged when complete, but
 that's water under the bridge.

 I see the following options:

 1. Fixup the current symlink support so that symlinks are ready for
 2.2 (GA), or at least the public APIs. This means the APIs will be in
 GA from the get go so while the functionality might be fully baked we
 don't have to worry about incompatible changes like FileStatus#isDir()
 changing behavior in 2.3 or a later update.  The downside is this will
 take at least a couple weeks (to resolve HADOOP-9912 and potentially
 implement the remaining pieces) and so may impact the 2.2 release
 timing. This option means 2.2 won't remove the new APIs introduced in
 2.1.  We'd want to spin a 2.1.2 beta with the new API changes so we
 don't introduce new APIs in the beta to GA transition.


I'm reluctant for this as while delaying the release, because we are going
to find problems all the way up the stack -which will require a
choreographed set of changes. Given the grief of the protbuf update, I
don't want to go near that just before the final release.


We already have lots of 1.x era code that assume !isDir() == isFile() -I
know that from spending lots of time in the FS specification layer. That's
something which is going to break with Symlinks, irrespective of when the
feature is rolled out.

The other thing we have to do is push back the API changes into 1.x, at
least at the FileSystem interface layer, so that code which uses
IsDirectory, isSymlink, etc does not need to be edited to compile  run
against both versions. I know Chris Nauroth has been doing this, but think
we need to make sure it is all there. This will let things like Pig compile
against all versions with symlink-ready code.

The other issues is thatit goes on to increase the pressure to get other
features in there hey, we've got 2 more weeks! let's add X!(where for me,
X:={HADOOP-8545, some restrictions on valid names of app types  instance
names for YARN, ...).

My vote then: freeze and ship. We're happy with the wire formats, the API
has added knowledge of Symlink and Filesystem features can evolve
afterwards -with layers above handling the changes.




 2. Revert symlinks from branch-2.1-beta and branch-2. Finish up the
 work in trunk (or a feature branch) and merge for a subsequent 2.x
 update.  While this helps get us to GA faster it would be preferable
 to get an API change like this in for 2.2 GA since they may be
 disruptive to introduce in an update (eg see example in #1). And of
 course our users would like symlinks functionality in the GA release.
 This option would mean 2.2 is incompatible with 2.1 because it's
 dropping the new APIs, not ideal for a beta to GA transition.



Why just ship as is, with a note symlinks not live yet, leave alone.
That's what's been in the betas to date.



 3. Revert and punt symlinks to 3.x.  IMO should be the last resort.


I'd prefer it in 2.3 -which is where I'm targeting all my feature creep.

IMO 2.1 is frozen except for bug fixes

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: symlink support in Hadoop 2 GA

2013-09-18 Thread Alejandro Abdelnur
On Wed, Sep 18, 2013 at 11:29 AM, Steve Loughran ste...@hortonworks.comwrote:

 I'm reluctant for this as while delaying the release, because we are going
 to find problems all the way up the stack -which will require a
 choreographed set of changes. Given the grief of the protbuf update, I
 don't want to go near that just before the final release.


Well, I would use the exact same argument used for protobuf (which only
complication was getting protoc 2.5.0 in the jenkins boxes and communicate
developers to do the same, other than that we didn't hit any other issue
AFAIK) ...

IMO, it makes more sense to do this change during the beta rather than when
GA. That gives us more flexibility to iron out things if necessary.

thx

-- 
Alejandro


Re: symlink support in Hadoop 2 GA

2013-09-18 Thread Steve Loughran
On 18 September 2013 12:53, Alejandro Abdelnur t...@cloudera.com wrote:

 On Wed, Sep 18, 2013 at 11:29 AM, Steve Loughran ste...@hortonworks.com
 wrote:

  I'm reluctant for this as while delaying the release, because we are
 going
  to find problems all the way up the stack -which will require a
  choreographed set of changes. Given the grief of the protbuf update, I
  don't want to go near that just before the final release.
 

 Well, I would use the exact same argument used for protobuf (which only
 complication was getting protoc 2.5.0 in the jenkins boxes and communicate
 developers to do the same, other than that we didn't hit any other issue
 AFAIK) ...


protobuf was traumatic at build time, as I recall because it was neither
forwards or backwards compatible. Those of us trying to build different
branches had to choose which version to have on the path, or set up scripts
to do the switching. HBase needed rebuilding, so did other things. And I
still have the pain of downloading and installing protoc on all Linux VMs I
build up going forward, until apt-get and yum have protoc 2.5 artifacts.

This means it was very painful for developer, added a lot of late breaking
pain to the developers, but it had one key feature that gave it an edge: it
was immediately obvious where you had a problem as things didn't compile or
classload without linkage problems. No latent bugs, unless protobuf 2.5 has
them internally -for which we have to rely on google's release testing to
have found.

That is a lot simpler to regression test than adding any new feature to
HDFS and seeing what breaks -as that is something that only surfaces out in
the field. Which is why I think it's too late in the 2.1 release timetable
to add symlinks. We've had a 2.1-beta out there, we've got feedback. Fix
those problems that are show stoppers, but don't add more stuff. Which is
precisely why I have not been pushing in any of my recent changes. I may
seem ruthless arguing against symlinks -but I'm not being inconsistent with
my own commit history. The only two things I've put in branch-2.1 since
beta-1 were a separate log for the Configuration deprecation warnings and a
patch to the POM for a java7 build on OSX: and they weren't even my patches.


-Steve

(One of these days I should volunteer to be the release manager and it'll
be obvious that Arun is being quite amenable to all the other developers)




 IMO, it makes more sense to do this change during the beta rather than when
 GA. That gives us more flexibility to iron out things if necessary.


I'm arguing this change can go into the beta of the successor to 2.1 -not
GA.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: symlink support in Hadoop 2 GA

2013-09-18 Thread Eli Collins
On Wed, Sep 18, 2013 at 5:45 AM, Steve Loughran ste...@hortonworks.comwrote:

 On 18 September 2013 12:53, Alejandro Abdelnur t...@cloudera.com wrote:

  On Wed, Sep 18, 2013 at 11:29 AM, Steve Loughran ste...@hortonworks.com
  wrote:
 
   I'm reluctant for this as while delaying the release, because we are
  going
   to find problems all the way up the stack -which will require a
   choreographed set of changes. Given the grief of the protbuf update, I
   don't want to go near that just before the final release.
  
 
  Well, I would use the exact same argument used for protobuf (which only
  complication was getting protoc 2.5.0 in the jenkins boxes and
 communicate
  developers to do the same, other than that we didn't hit any other issue
  AFAIK) ...
 

 protobuf was traumatic at build time, as I recall because it was neither
 forwards or backwards compatible. Those of us trying to build different
 branches had to choose which version to have on the path, or set up scripts
 to do the switching. HBase needed rebuilding, so did other things. And I
 still have the pain of downloading and installing protoc on all Linux VMs I
 build up going forward, until apt-get and yum have protoc 2.5 artifacts.

 This means it was very painful for developer, added a lot of late breaking
 pain to the developers, but it had one key feature that gave it an edge: it
 was immediately obvious where you had a problem as things didn't compile or
 classload without linkage problems. No latent bugs, unless protobuf 2.5 has
 them internally -for which we have to rely on google's release testing to
 have found.

 That is a lot simpler to regression test than adding any new feature to
 HDFS and seeing what breaks -as that is something that only surfaces out in
 the field. Which is why I think it's too late in the 2.1 release timetable
 to add symlinks. We've had a 2.1-beta out there, we've got feedback. Fix
 those problems that are show stoppers, but don't add more stuff. Which is
 precisely why I have not been pushing in any of my recent changes. I may
 seem ruthless arguing against symlinks -but I'm not being inconsistent with
 my own commit history. The only two things I've put in branch-2.1 since
 beta-1 were a separate log for the Configuration deprecation warnings and a
 patch to the POM for a java7 build on OSX: and they weren't even my
 patches.


 -Steve

 (One of these days I should volunteer to be the release manager and it'll
 be obvious that Arun is being quite amenable to all the other developers)



 
  IMO, it makes more sense to do this change during the beta rather than
 when
  GA. That gives us more flexibility to iron out things if necessary.
 
 
 I'm arguing this change can go into the beta of the successor to 2.1 -not
 GA.


What does this change refer to?  Symlinks are already in 2.1, and the
existing semantics create problems for programs (eg see the pig
example in HADOOP-9912)
that we need to resolve.  I don't think do nothing is an option for 2.2. GA.

Thanks,
Eli







 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Re: symlink support in Hadoop 2 GA

2013-09-18 Thread Arun C Murthy

On Sep 16, 2013, at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote:

 Hi all,
 
 I wanted to broadcast plans for putting the FileSystem symlinks work
 (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think
 it's pretty important we get it in since it's not a compatible change; if
 it misses the GA train, we're not going to have symlinks until the next
 major release.

Just catching up, is this an incompatible change, or not? The above reads 'not 
an incompatible change'.

Arun

 
 However, we're still dealing with ongoing issues revealed via testing.
 There's user-code out there that only handles files and directories and
 will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
 for a nice example where globStatus returning symlinks broke Pig; some of
 us had a conference call to talk it through, and one definite conclusion
 was that this wasn't solvable in a generally compatible manner.
 
 There are also still some gaps in symlink support right now. For example,
 the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
 resolution, and tooling like the FsShell and Distcp still need to be
 updated as well.
 
 So, there's definitely work to be done, but there are a lot of users
 interested in the feature, and symlinks really should be in GA. Would
 appreciate any thoughts/input on the matter.
 
 Thanks,
 Andrew

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: symlink support in Hadoop 2 GA

2013-09-18 Thread J. Rottinghuis
However painful protobuf version changes are at build time for Hadoop
developers, at runtime with multiple clusters and many Hadoop users this is
a total nightmare.
Even upgrading clusters from one protobuf version to the next is going to
be very difficult. The same users will run jobs on, and/or readwrite to
multiple clusters. That means that they will have to fork their code, run
multiple instances? Or in the very least they have to do an update to their
applications. All in sync with Hadoop cluster changes. And these are not
doable in a rolling fashion.
All Hadoop and HBase clusters will all upgrade at the same time, or we'll
have to have our users fork / roll multiple versions ?
My point is that these things are much harder than just fix the (Jenkins)
build and we're done. These changes are massively disruptive.

There is a similar situation with symlinks. Having an API that lets users
create symlinks is very problematic. Some users create symlinks and as Eli
pointed out, somebody else (or automated process) tries to copy to / from
another (Hadoop 1.x?) cluster over hftp. What will happen ?
Having an API that people should not use is also a nightmare. We
experienced this with append. For a while it was there, but users were not
allowed to use it (or else there were large #'s of corrupt blocks). If
there is an API to create a symlink, then some of our users are going to
use it and others are going to trip over those symlinks. We already know
that Pig does not work with symlinks yet, and as Steve pointed out, there
is tons of other code out there that assumes that !isDir() means isFile().

I like symlink functionality, but in our migration to Hadoop 2.x this is a
total distraction. If the APIs stay in 2.2 GA we'll have to choose to:
a) Not uprev until symlink support is figured out up and down the stack,
and we've been able to migrate all our 1.x (equivalent) clusters to 2.x
(equivalent). Or
b) rip out the API altogether. Or
c) change the implementation to throw an UnsupportedOperationException
I'm not sure yet which of these I like least.

Thanks,

Joep




On Wed, Sep 18, 2013 at 9:48 AM, Arun C Murthy a...@hortonworks.com wrote:


 On Sep 16, 2013, at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote:

  Hi all,
 
  I wanted to broadcast plans for putting the FileSystem symlinks work
  (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I
 think
  it's pretty important we get it in since it's not a compatible change; if
  it misses the GA train, we're not going to have symlinks until the next
  major release.

 Just catching up, is this an incompatible change, or not? The above reads
 'not an incompatible change'.

 Arun

 
  However, we're still dealing with ongoing issues revealed via testing.
  There's user-code out there that only handles files and directories and
  will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
  for a nice example where globStatus returning symlinks broke Pig; some of
  us had a conference call to talk it through, and one definite conclusion
  was that this wasn't solvable in a generally compatible manner.
 
  There are also still some gaps in symlink support right now. For example,
  the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
  resolution, and tooling like the FsShell and Distcp still need to be
  updated as well.
 
  So, there's definitely work to be done, but there are a lot of users
  interested in the feature, and symlinks really should be in GA. Would
  appreciate any thoughts/input on the matter.
 
  Thanks,
  Andrew

 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/



 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Re: symlink support in Hadoop 2 GA

2013-09-17 Thread Colin McCabe
I think it makes sense to finish symlinks support in the Hadoop 2 GA release.

Colin

On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote:
 Hi all,

 I wanted to broadcast plans for putting the FileSystem symlinks work
 (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think
 it's pretty important we get it in since it's not a compatible change; if
 it misses the GA train, we're not going to have symlinks until the next
 major release.

 However, we're still dealing with ongoing issues revealed via testing.
 There's user-code out there that only handles files and directories and
 will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
 for a nice example where globStatus returning symlinks broke Pig; some of
 us had a conference call to talk it through, and one definite conclusion
 was that this wasn't solvable in a generally compatible manner.

 There are also still some gaps in symlink support right now. For example,
 the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
 resolution, and tooling like the FsShell and Distcp still need to be
 updated as well.

 So, there's definitely work to be done, but there are a lot of users
 interested in the feature, and symlinks really should be in GA. Would
 appreciate any thoughts/input on the matter.

 Thanks,
 Andrew


Re: symlink support in Hadoop 2 GA

2013-09-17 Thread Suresh Srinivas
I agree that this is an important change. However, 2.2.0 GA is getting
ready to rollout in weeks. I am concerned that these changes will add not
only incompatible changes late in the game, but also possibly instability.
Java API incompatibility is some thing we have avoided for the most part
and I am concerned that this is adding such incompatibility in FileSystem
APIs. We should find work arounds by adding possibly newer APIs and leaving
existing APIs as is. If this can be done, my vote is to enable this feature
in 2.3. Even if it cannot be done, I am concerned that this is coming quite
late and we should see if could allow some incompatible changes into 2.3
for this feature.


On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.comwrote:

 Hi all,

 I wanted to broadcast plans for putting the FileSystem symlinks work
 (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think
 it's pretty important we get it in since it's not a compatible change; if
 it misses the GA train, we're not going to have symlinks until the next
 major release.

 However, we're still dealing with ongoing issues revealed via testing.
 There's user-code out there that only handles files and directories and
 will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
 for a nice example where globStatus returning symlinks broke Pig; some of
 us had a conference call to talk it through, and one definite conclusion
 was that this wasn't solvable in a generally compatible manner.

 There are also still some gaps in symlink support right now. For example,
 the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
 resolution, and tooling like the FsShell and Distcp still need to be
 updated as well.

 So, there's definitely work to be done, but there are a lot of users
 interested in the feature, and symlinks really should be in GA. Would
 appreciate any thoughts/input on the matter.

 Thanks,
 Andrew




-- 
http://hortonworks.com/download/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: symlink support in Hadoop 2 GA

2013-09-17 Thread Colin McCabe
The issue is not modifying existing APIs.  The issue is that code has
been written that makes assumptions that are incompatible with the
existence of things that are not files or directories.  For example,
there is a lot of code out there that looks at FileStatus#isFile, and
if it returns false, assumes that what it is looking at is a
directory.  In the case of a symlink, this assumption is incorrect.

Faced with this, we have considered making the default behavior of
listStatus and globStatus to be fully resolving symlinks, and simply
not listing dangling symlinks. Code which is prepared to deal symlinks
can use newer versions of the listStatus and globStatus functions
which do return symlinks as symlinks.

We might consider defaulting FileSystem#listStatus and
FileSystem#globStatus to fully resolving symlinks by default and
defaulting FileContext#listStatus and FileContext#Util#globStatus to
the opposite.  This seems like the maximally compatible solution that
we're going to get.  I think this makes sense.

The alternative is kicking the can down the road to Hadoop 3, and
letting vendors of alternative (including some proprietary
alternative) systems continue to claim that Hadoop doesn't support
symlinks yet (with some justice).

P.S.  I would be fine with putting this in 2.2 or 2.3 if that seems
more appropriate.

sincerely,
Colin

On Tue, Sep 17, 2013 at 8:23 AM, Suresh Srinivas sur...@hortonworks.com wrote:
 I agree that this is an important change. However, 2.2.0 GA is getting
 ready to rollout in weeks. I am concerned that these changes will add not
 only incompatible changes late in the game, but also possibly instability.
 Java API incompatibility is some thing we have avoided for the most part
 and I am concerned that this is adding such incompatibility in FileSystem
 APIs. We should find work arounds by adding possibly newer APIs and leaving
 existing APIs as is. If this can be done, my vote is to enable this feature
 in 2.3. Even if it cannot be done, I am concerned that this is coming quite
 late and we should see if could allow some incompatible changes into 2.3
 for this feature.


 On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.comwrote:

 Hi all,

 I wanted to broadcast plans for putting the FileSystem symlinks work
 (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think
 it's pretty important we get it in since it's not a compatible change; if
 it misses the GA train, we're not going to have symlinks until the next
 major release.

 However, we're still dealing with ongoing issues revealed via testing.
 There's user-code out there that only handles files and directories and
 will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
 for a nice example where globStatus returning symlinks broke Pig; some of
 us had a conference call to talk it through, and one definite conclusion
 was that this wasn't solvable in a generally compatible manner.

 There are also still some gaps in symlink support right now. For example,
 the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
 resolution, and tooling like the FsShell and Distcp still need to be
 updated as well.

 So, there's definitely work to be done, but there are a lot of users
 interested in the feature, and symlinks really should be in GA. Would
 appreciate any thoughts/input on the matter.

 Thanks,
 Andrew




 --
 http://hortonworks.com/download/

 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.


Re: symlink support in Hadoop 2 GA

2013-09-17 Thread Andrew Wang
I encourage interested parties to read through HADOOP-9912 to get a feel
for the issues. There really is no way to add symlink support without
changing the behavior of existing APIs. Ultimately, anything that returns a
FileStatus is going to be different. Even if we default to resolving
symlinks, resolving can lead to FileNotFound or permission errors. Thus, we
have to choose whether to prune the bad links, show the bad links as
dangling, or throwing an exception. None of these options are compatible.

I'm really concerned about putting this in a minor release like 2.3 since
it has the potential to break a lot of user code. HADOOP-9912 is an example
from within our own ecosystem, but think of all the custom user code out
there written against FileSystem. 2.2 GA is basically our last chance to
make this kind of change before Hadoop 3.

Thanks,
Andrew


On Tue, Sep 17, 2013 at 9:10 AM, Colin McCabe cmcc...@alumni.cmu.eduwrote:

 The issue is not modifying existing APIs.  The issue is that code has
 been written that makes assumptions that are incompatible with the
 existence of things that are not files or directories.  For example,
 there is a lot of code out there that looks at FileStatus#isFile, and
 if it returns false, assumes that what it is looking at is a
 directory.  In the case of a symlink, this assumption is incorrect.

 Faced with this, we have considered making the default behavior of
 listStatus and globStatus to be fully resolving symlinks, and simply
 not listing dangling symlinks. Code which is prepared to deal symlinks
 can use newer versions of the listStatus and globStatus functions
 which do return symlinks as symlinks.

 We might consider defaulting FileSystem#listStatus and
 FileSystem#globStatus to fully resolving symlinks by default and
 defaulting FileContext#listStatus and FileContext#Util#globStatus to
 the opposite.  This seems like the maximally compatible solution that
 we're going to get.  I think this makes sense.

 The alternative is kicking the can down the road to Hadoop 3, and
 letting vendors of alternative (including some proprietary
 alternative) systems continue to claim that Hadoop doesn't support
 symlinks yet (with some justice).

 P.S.  I would be fine with putting this in 2.2 or 2.3 if that seems
 more appropriate.

 sincerely,
 Colin

 On Tue, Sep 17, 2013 at 8:23 AM, Suresh Srinivas sur...@hortonworks.com
 wrote:
  I agree that this is an important change. However, 2.2.0 GA is getting
  ready to rollout in weeks. I am concerned that these changes will add not
  only incompatible changes late in the game, but also possibly
 instability.
  Java API incompatibility is some thing we have avoided for the most part
  and I am concerned that this is adding such incompatibility in FileSystem
  APIs. We should find work arounds by adding possibly newer APIs and
 leaving
  existing APIs as is. If this can be done, my vote is to enable this
 feature
  in 2.3. Even if it cannot be done, I am concerned that this is coming
 quite
  late and we should see if could allow some incompatible changes into 2.3
  for this feature.
 
 
  On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.com
 wrote:
 
  Hi all,
 
  I wanted to broadcast plans for putting the FileSystem symlinks work
  (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I
 think
  it's pretty important we get it in since it's not a compatible change;
 if
  it misses the GA train, we're not going to have symlinks until the next
  major release.
 
  However, we're still dealing with ongoing issues revealed via testing.
  There's user-code out there that only handles files and directories and
  will barf when given a symlink (perhaps a dangling one!). See
 HADOOP-9912
  for a nice example where globStatus returning symlinks broke Pig; some
 of
  us had a conference call to talk it through, and one definite conclusion
  was that this wasn't solvable in a generally compatible manner.
 
  There are also still some gaps in symlink support right now. For
 example,
  the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need
 symlink
  resolution, and tooling like the FsShell and Distcp still need to be
  updated as well.
 
  So, there's definitely work to be done, but there are a lot of users
  interested in the feature, and symlinks really should be in GA. Would
  appreciate any thoughts/input on the matter.
 
  Thanks,
  Andrew
 
 
 
 
  --
  http://hortonworks.com/download/
 
  --
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the reader
  of this message is not the intended recipient, you are hereby notified
 that
  any printing, copying, dissemination, distribution, disclosure or
  forwarding of this communication is strictly prohibited. If you have
  received this communication in error, 

Re: symlink support in Hadoop 2 GA

2013-09-17 Thread Eli Collins
(Looping in Arun since this impacts 2.x releases)

I updated the versions on HADOOP-8040 and sub-tasks to reflect where
the changes have landed. All of these changes (modulo HADOOP-9417)
were merged to branch-2.1 and are in the 2.1.0 release.

While symlinks are in 2.1.0 I don't think we can really claim they're
ready until issues like HADOOP-9912 are resolved, and they are
supported in the shell, distcp and WebHDFS/HttpFS/Hftp (these are not
esoteric!).  Someone can create a symlink with FileSystem causing
someone else's distcp job to fail. Unlikely given they're not exposed
outside the Java API but still not great.   Ideally this work would
have been done on a feature branch and then merged when complete, but
that's water under the bridge.

I see the following options:

1. Fixup the current symlink support so that symlinks are ready for
2.2 (GA), or at least the public APIs. This means the APIs will be in
GA from the get go so while the functionality might be fully baked we
don't have to worry about incompatible changes like FileStatus#isDir()
changing behavior in 2.3 or a later update.  The downside is this will
take at least a couple weeks (to resolve HADOOP-9912 and potentially
implement the remaining pieces) and so may impact the 2.2 release
timing. This option means 2.2 won't remove the new APIs introduced in
2.1.  We'd want to spin a 2.1.2 beta with the new API changes so we
don't introduce new APIs in the beta to GA transition.

2. Revert symlinks from branch-2.1-beta and branch-2. Finish up the
work in trunk (or a feature branch) and merge for a subsequent 2.x
update.  While this helps get us to GA faster it would be preferable
to get an API change like this in for 2.2 GA since they may be
disruptive to introduce in an update (eg see example in #1). And of
course our users would like symlinks functionality in the GA release.
This option would mean 2.2 is incompatible with 2.1 because it's
dropping the new APIs, not ideal for a beta to GA transition.

3. Revert and punt symlinks to 3.x.  IMO should be the last resort.

If we have sufficient time I think option #1 would be best.  What do
others think?

Thanks,
Eli


On Mon, Sep 16, 2013 at 6:49 PM, Andrew Wang andrew.w...@cloudera.com wrote:
 Hi all,

 I wanted to broadcast plans for putting the FileSystem symlinks work
 (HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think
 it's pretty important we get it in since it's not a compatible change; if
 it misses the GA train, we're not going to have symlinks until the next
 major release.

 However, we're still dealing with ongoing issues revealed via testing.
 There's user-code out there that only handles files and directories and
 will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
 for a nice example where globStatus returning symlinks broke Pig; some of
 us had a conference call to talk it through, and one definite conclusion
 was that this wasn't solvable in a generally compatible manner.

 There are also still some gaps in symlink support right now. For example,
 the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
 resolution, and tooling like the FsShell and Distcp still need to be
 updated as well.

 So, there's definitely work to be done, but there are a lot of users
 interested in the feature, and symlinks really should be in GA. Would
 appreciate any thoughts/input on the matter.

 Thanks,
 Andrew


symlink support in Hadoop 2 GA

2013-09-16 Thread Andrew Wang
Hi all,

I wanted to broadcast plans for putting the FileSystem symlinks work
(HADOOP-8040) into branch-2.1 for the pending Hadoop 2 GA release. I think
it's pretty important we get it in since it's not a compatible change; if
it misses the GA train, we're not going to have symlinks until the next
major release.

However, we're still dealing with ongoing issues revealed via testing.
There's user-code out there that only handles files and directories and
will barf when given a symlink (perhaps a dangling one!). See HADOOP-9912
for a nice example where globStatus returning symlinks broke Pig; some of
us had a conference call to talk it through, and one definite conclusion
was that this wasn't solvable in a generally compatible manner.

There are also still some gaps in symlink support right now. For example,
the more esoteric FileSystems like WebHDFS, HttpFS, and HFTP need symlink
resolution, and tooling like the FsShell and Distcp still need to be
updated as well.

So, there's definitely work to be done, but there are a lot of users
interested in the feature, and symlinks really should be in GA. Would
appreciate any thoughts/input on the matter.

Thanks,
Andrew