[jira] [Created] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

Roman Shaposhnik (Created) (JIRA) Wed, 25 Jan 2012 11:50:03 -0800

ShuffleHandler can't access results when configured in a secure mode
--------------------------------------------------------------------


                 Key: MAPREDUCE-3728
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2, nodemanager
    Affects Versions: 0.23.0
            Reporter: Roman Shaposhnik
             Fix For: 0.23.1


While running the simplest of jobs (Pi) on MR2 in a fully secure configuration 
I have noticed that the job was failing on the reduce side with the following 
messages littering the nodemanager logs:

{noformat}
2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle 
error
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_000003_0/file.out.index
 in any of the configured local directories
{noformat}

While digging further I found out that the permissions on the files/dirs were 
prohibiting nodemanager (running under the user yarn) to access these files:

{noformat}
$ ls -l 
/data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_000001_0
-rw-r----- 1 testuser testuser 28 Jan 20 15:41 file.out
-rw-r----- 1 testuser testuser 32 Jan 20 15:41 file.out.index
{noformat}

Digging even further revealed that the group-sticky bit that was faithfully put 
on all the subdirectories between testuser and application_1327102703969_0001 
was gone from output and attempt_1327102703969_0001_m_000001_0. 

Looking into how these subdirectories are created 
(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
{noformat}
      // $x/usercache/$user/appcache/$appId/filecache
      Path appFileCacheDir = new Path(appBase, FILECACHE);
      appsFileCacheDirs[i] = appFileCacheDir.toString();
      lfs.mkdir(appFileCacheDir, null, false);
      // $x/usercache/$user/appcache/$appId/output
      lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
{noformat}

Reveals that lfs.mkdir ends up manipulating permissions and thus clears sticky 
bit from output and filecache.

At this point I'm at a loss about how this is supposed to work. My 
understanding was
that the whole sequence of events here was predicated on a sticky bit set so
that daemons running under the user yarn (default group yarn) can have access
to the resulting files and subdirectories down at output and below. Please let
me know if I'm missing something or whether this is just a bug that needs to be 
fixed.

On a related note, when the shuffle side of the Pi job failed the job itself 
didn't.
It went into the endless loop and only exited when it exhausted all the local 
storage
for the log files (at which point the nodemanager died and thus the job ended). 
Perhaps
this is even more serious side effect of this issue that needs to be 
investigated 
separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

Reply via email to