Yes, the s3 and s3n implementations work just fine with drill ... there's an
excellent blog post on the apache drill site about enabling S3 at
http://drill.apache.org/blog/2014/12/09/running-sql-queries-on-amazon-s3/ .
To Steven's point, there is a property defining the local file system to which
the objects from S3 may be temporarily staged. By default, that is in /tmp.
You can change the property fs.s3.buffer.dir along with your S3 credentials
should you need additional space for large object transfers.
Because the s3 storage plug-in is so similar to the basic file-system plug-in,
I usually add a bucket to my Drill deployments by default. The json example
below can be used to create a plug-in specification for the AMPLab benchmark
data (though I do default to disabled just in case)
{
"name" : "s3amplab",
"config" : {
"type" : "file",
"enabled" : false,
"connection" : "s3n://big-data-benchmark",
"workspaces" : {
"root" : {
"location" : "/",
"writable" : false,
"defaultInputFormat" : null
}
},
"formats" : {
"psv" : {
"type" : "text",
"extensions" : [ "tbl" ],
"delimiter" : "|"
},
"csv" : {
"type" : "text",
"extensions" : [ "csv" ],
"delimiter" : ","
},
"tsv" : {
"type" : "text",
"extensions" : [ "tsv" ],
"delimiter" : "\t"
},
"parquet" : {
"type" : "parquet"
},
"json" : {
"type" : "json"
}
}
}
}
Just load it into your drill storage configuration with
curl -X POST -H "Content-Type: application/json" --upload-file
${S3_JASON_PLUGIN_FILE} \
http://${DRILL_SERVER}:8047/storage/s3amplab.json
Happy drilling !
-- David
On Feb 6, 2015, at 1:58 PM, Steven Phillips <[email protected]> wrote:
> I don't really know anything about Hadoop encryption, so I will not address
> question 1.
>
> 2) The "filesystem" storage in drill uses the Hadoop Filesystem API. The
> filesystem type is configured as part of the storage plugin configuration,
> in the "connection" field.
>
> When executing a query against any "filesystem" storage, drill uses the
> getBlockLocation() method the Filesystem api to get a lost of blocks along
> with the locations of each block. It uses this information to assign
> fragments to the drillbits. Within each fragment, the filesystem api is
> used to read the data from the filesystem.
>
> I'm not sure how the getBlockLocations() method is implemented for the s3
> filesystem, but I believe it splits the file based on some configuration
> property for blocksize. I am not sure what locations are returned for the
> block locations.
>
> 3) I haven't tried this, but if there is a filesystem implementation for s3
> and s3n, then they should both work with drill.
>
> On Thu, Feb 5, 2015 at 10:42 PM, Derek Rabindran <[email protected]> wrote:
>
>> Hi,
>>
>> My use case involves using Drill in combination with S3. I have a few
>> questions:
>>
>> 1) Is it possible to decrypt the files before processing? My files are
>> client-side encrypted. I'm able to provide the master key, however, I'm
>> not sure at which level this should be configured.
>>
>> 2) What is Hadoops role when using Drill with S3? Can you outline the
>> details of what's actually happening when we execute a drill request on
>> files residing in S3?
>>
>> 3) Will this work for both S3 and S3n?
>>
>> Thanks
>>
>
>
>
> --
> Steven Phillips
> Software Engineer
>
> mapr.com