Re: Apache Drill and S3

David Tucker Sun, 08 Feb 2015 08:04:12 -0800

Yes, the s3 and s3n implementations work just fine with drill ... there's an 
excellent blog post on the apache drill site about enabling S3 at 
http://drill.apache.org/blog/2014/12/09/running-sql-queries-on-amazon-s3/ .   
To Steven's point, there is a property defining the local file system to which 
the objects from S3 may be temporarily staged.   By default, that is in /tmp.   
You can change the property fs.s3.buffer.dir along with your S3 credentials 
should you need additional space for large object transfers.


Because the s3 storage plug-in is so similar to the basic file-system plug-in, 
I usually add a bucket to my Drill deployments by default.   The json example 
below can be used to create a plug-in specification for the AMPLab benchmark 
data (though I do default to disabled just in case)

{
  "name" : "s3amplab",
  "config" : {
    "type" : "file",
    "enabled" : false,
    "connection" : "s3n://big-data-benchmark",
    "workspaces" : {
      "root" : {
        "location" : "/",
        "writable" : false,
        "defaultInputFormat" : null
      }
    },
    "formats" : {
      "psv" : {
        "type" : "text",
        "extensions" : [ "tbl" ],
        "delimiter" : "|"
      },
      "csv" : {
        "type" : "text",
        "extensions" : [ "csv" ],
        "delimiter" : ","
      },
      "tsv" : {
        "type" : "text",
        "extensions" : [ "tsv" ],
        "delimiter" : "\t"
      },
      "parquet" : {
        "type" : "parquet"
      },
      "json" : {
        "type" : "json"
      }
    }
  }
}


Just load it into your drill storage configuration with
        curl -X POST -H "Content-Type: application/json"  --upload-file 
${S3_JASON_PLUGIN_FILE} \
          http://${DRILL_SERVER}:8047/storage/s3amplab.json

Happy drilling !

-- David

On Feb 6, 2015, at 1:58 PM, Steven Phillips <[email protected]> wrote:

> I don't really know anything about Hadoop encryption, so I will not address
> question 1.
> 
> 2) The "filesystem" storage in drill uses the Hadoop Filesystem API. The
> filesystem type is configured as part of the storage plugin configuration,
> in the "connection" field.
> 
> When executing a query against any "filesystem" storage, drill uses the
> getBlockLocation() method the Filesystem api to get a lost of blocks along
> with the locations of each block. It uses this information to assign
> fragments to the drillbits. Within each fragment, the filesystem api is
> used to read the data from the filesystem.
> 
> I'm not sure how the getBlockLocations() method is implemented for the s3
> filesystem, but I believe it splits the file based on some configuration
> property for blocksize. I am not sure what locations are returned for the
> block locations.
> 
> 3) I haven't tried this, but if there is a filesystem implementation for s3
> and s3n, then they should both work with drill.
> 
> On Thu, Feb 5, 2015 at 10:42 PM, Derek Rabindran <[email protected]> wrote:
> 
>> Hi,
>> 
>> My use case involves using Drill in combination with S3.  I have a few
>> questions:
>> 
>> 1) Is it possible to decrypt the files before processing?  My files are
>> client-side encrypted.  I'm able to provide the master key, however, I'm
>> not sure at which level this should be configured.
>> 
>> 2) What is Hadoops role when using Drill with S3?  Can you outline the
>> details of what's actually happening when we execute a drill request on
>> files residing in S3?
>> 
>> 3) Will this work for both S3 and S3n?
>> 
>> Thanks
>> 
> 
> 
> 
> -- 
> Steven Phillips
> Software Engineer
> 
> mapr.com

Re: Apache Drill and S3

Reply via email to