[jira] Issue Comment Edited: (HADOOP-930) Add support for reading regular (non-block-based) files from S3 in S3FileSystem

Chris K Wensel (JIRA) Tue, 03 Jun 2008 09:27:06 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601991#action_12601991
 ]


cwensel edited comment on HADOOP-930 at 6/3/08 9:25 AM:
---------------------------------------------------------------

-- Any reason you didn't use the mime type to denote directory files (as jets3t 
does)?
{code:java}
  public static boolean isDirectory( S3Object object )
    {
    return object.getContentType() != null && 
object.getContentType().equalsIgnoreCase( MIME_DIRECTORY );
    }
{code}
where
{code:java} 
  public static final String MIME_DIRECTORY = "application/x-directory";
{code}

-- I believe MD5 checksum should be set on s3 put (via header), and verified on 
s3 get. I see plenty of read failures because of checksum failures (though they 
could be side effects of stream reading timeouts in retrospect). This is 
especially useful if non Hadoop applications are dealing with the S3 data 
shared with Hadoop.

-- Sometimes 'legacy' buckets have underscores, might consider trying to 
survive them..

{code:java}
    String userInfo = uri.getUserInfo();

    // special handling for underscores in bucket names
    if( userInfo == null )
      {
      String authority = uri.getAuthority();
      String split[] = authority.split( "[:@]" );

      if( split.length >= 2 )
        userInfo = split[ 0 ] + ":" + split[ 1 ];
      }
{code}
and
{code:java}
    String bucketName = uri.getAuthority();

    // handling for underscore in bucket name
    if( bucketName.contains( "@" ) )
      bucketName = bucketName.split( "@" )[ 1 ];
{code}


      was (Author: cwensel):
    
-- Any reason you didn't use the mime type to denote directory files (as jets3t 
does)?

  public static boolean isDirectory( S3Object object )
    {
    return object.getContentType() != null && 
object.getContentType().equalsIgnoreCase( MIME_DIRECTORY );
    }

where 
  public static final String MIME_DIRECTORY = "application/x-directory";

-- I believe MD5 checksum should be set on s3 put (via header), and verified on 
s3 get. I see plenty of read failures because of checksum failures (though they 
could be side effects of stream reading timeouts in retrospect). This is 
especially useful if non Hadoop applications are dealing with the S3 data 
shared with Hadoop.

-- Sometimes 'legacy' buckets have underscores, might consider trying to 
survive them..

    String userInfo = uri.getUserInfo();

    // special handling for underscores in bucket names
    if( userInfo == null )
      {
      String authority = uri.getAuthority();
      String split[] = authority.split( "[:@]" );

      if( split.length >= 2 )
        userInfo = split[ 0 ] + ":" + split[ 1 ];
      }

and

    String bucketName = uri.getAuthority();

    // handling for underscore in bucket name
    if( bucketName.contains( "@" ) )
      bucketName = bucketName.split( "@" )[ 1 ];


  
> Add support for reading regular (non-block-based) files from S3 in 
> S3FileSystem
> -------------------------------------------------------------------------------
>
>                 Key: HADOOP-930
>                 URL: https://issues.apache.org/jira/browse/HADOOP-930
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.10.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.18.0
>
>         Attachments: hadoop-930-v2.patch, hadoop-930-v3.patch, 
> hadoop-930-v4.patch, hadoop-930.patch, jets3t-0.6.0.jar
>
>
> People often have input data on S3 that they want to use for a Map Reduce job 
> and the current S3FileSystem implementation cannot read it since it assumes a 
> block-based format.
> We would add the following metadata to files written by S3FileSystem: an 
> indication that it is block oriented ("S3FileSystem.type=block") and a 
> filesystem version number ("S3FileSystem.version=1.0"). Regular S3 files 
> would not have the type metadata so S3FileSystem would not try to interpret 
> them as inodes.
> An extension to write regular files to S3 would not be covered by this change 
> - we could do this as a separate piece of work (we still need to decide 
> whether to introduce another scheme - e.g. rename block-based S3 to "s3fs" 
> and call regular S3 "s3" - or whether to just use a configuration property to 
> control block-based vs. regular writes).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-930) Add support for reading regular (non-block-based) files from S3 in S3FileSystem

Reply via email to