Using Apache Drill - Sample Data and Sample Queries - Examples

Jim Scott Thu, 14 Aug 2014 12:32:55 -0700

I recently gave a presentation on how to use Apache Drill with some
demonstrations.


The video for the presentation is available here:
http://vimeo.com/chug/using-apache-drill
The slides are available here:
http://www.slideshare.net/ChicagoHUG/using-apache-drill-chug-august-2014-jim-scott

*The data used in this presentation can be found and downloaded (1.3GB CSV
file and a 100KB CSV file) *
*here:*
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
https://data.cityofchicago.org/Public-Safety/Sex-Offenders/vc9r-bqvy

Once you start your drill session you can connect to it via:
http://localhost:8047/


*Queries (pay attention to back ticks) *
To see a full stack trace when errors occur:
alter session set `exec.errors.verbose`=true;


*Looking through the information schema:*SELECT * from
INFORMATION_SCHEMA.SCHEMATA;
SELECT * from INFORMATION_SCHEMA.`TABLES`;
SELECT * from INFORMATION_SCHEMA.`COLUMNS`;
SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE
TABLE_SCHEMA = 'sys' and TABLE_NAME = 'options';


*How many of each type of incident:*select count(*) as incidents,
columns[5] as category from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by
columns[5] order by incidents desc limit 10;


*How many of each type of incident and subtype:*select count(*) as
incidents, columns[5] as type, columns[6] as subtype from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by
columns[6], columns[5] order by incidents desc limit 10;


*CREATE A TABLE in PARQUET format:*

*Please note that I created a workspace "db" at my path of /opt/drill/db
and I gave it a default storage format of parquet. *

create table dfs.db.`crime_data` as select columns[0] as REFID, columns[1]
as CaseNumber, columns[2] as DateTime, columns[3] as Block, columns[4] as
IUCR, columns[5] as PrimaryType, columns[6] as Description, columns[7] as
LocationDescription, columns[8] as Arrest, columns[9] as Domestic,
columns[10] as Beat, columns[11] as District, columns[12] as Ward,
columns[13] as CommunityArea, columns[14] as FBICode, columns[15] as
XCoordinate, columns[16] as YCoordinate, columns[17] as YEAR_DATE,
columns[18] as UpdatedOn, columns[19] as Latitude, columns[20] as
Longitude, columns[21] as Location from
dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv`;
create table dfs.db.`sex_offenders` as select columns[0] as LAST,columns[1]
as FIRST,columns[2] as BLOCK,columns[3] as GENDER,columns[4] as
RACE,columns[5] as BIRTHDATE,columns[6] as AGE,columns[7] as
HEIGHT,columns[8] as WEIGHT,columns[9] as VICTIM_MINOR from
dfs.`/home/jscott/Downloads/chicago/Sex_Offenders.csv`;


*Run the previous query on the newly created table:*select count(*) as
incidents, primarytype, description from dfs.db.`crime_data` group by
description, primarytype order by incidents desc limit 10;


*How many crimes occurred on a block with a registered sex offender?*SELECT
count(*) as incidents FROM dfs.db.`crime_data` crimes JOIN
dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block ORDER BY
incidents DESC;


*What types of incidents occurred?*SELECT count(*) as incidents,
crimes.primarytype as type, crimes.description as subtype FROM
dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON
crimes.block = offenders.block GROUP BY crimes.description,
crimes.primarytype ORDER BY incidents DESC LIMIT 20;


*What all is in the BATTERY category?*SELECT count(*) as incidents,
crimes.description as subtype FROM dfs.db.`crime_data` crimes JOIN
dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block WHERE
crimes.primarytype = 'BATTERY' GROUP BY crimes.description ORDER BY
incidents DESC;


*How many total BATTERY incidents?*SELECT count(*) as incidents FROM
dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON
crimes.block = offenders.block WHERE crimes.primarytype = 'BATTERY';


*How many "BATTERY" incidents occurred on those blocks by year (has it
gotten better or worse over time)?*SELECT count(*) as incidents,
extract(year from to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) as
`year` FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders`
offenders ON crimes.block = offenders.block WHERE crimes.primarytype =
'BATTERY' and crimes.datetime not like '%Date%' GROUP BY extract(year from
to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) ORDER BY `year` ASC;


Do you have queries against this data set that you have written that you
would like to share with others? Reply to the thread... 


-- 
*Jim Scott*
Director, Enterprise Strategy & Architecture

 <http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>

Using Apache Drill - Sample Data and Sample Queries - Examples

Reply via email to