I recently gave a presentation on how to use Apache Drill with some demonstrations.
The video for the presentation is available here: http://vimeo.com/chug/using-apache-drill The slides are available here: http://www.slideshare.net/ChicagoHUG/using-apache-drill-chug-august-2014-jim-scott *The data used in this presentation can be found and downloaded (1.3GB CSV file and a 100KB CSV file) * *here:* https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 https://data.cityofchicago.org/Public-Safety/Sex-Offenders/vc9r-bqvy Once you start your drill session you can connect to it via: http://localhost:8047/ *Queries (pay attention to back ticks) * To see a full stack trace when errors occur: alter session set `exec.errors.verbose`=true; *Looking through the information schema:*SELECT * from INFORMATION_SCHEMA.SCHEMATA; SELECT * from INFORMATION_SCHEMA.`TABLES`; SELECT * from INFORMATION_SCHEMA.`COLUMNS`; SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'sys' and TABLE_NAME = 'options'; *How many of each type of incident:*select count(*) as incidents, columns[5] as category from dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by columns[5] order by incidents desc limit 10; *How many of each type of incident and subtype:*select count(*) as incidents, columns[5] as type, columns[6] as subtype from dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv` group by columns[6], columns[5] order by incidents desc limit 10; *CREATE A TABLE in PARQUET format:* *Please note that I created a workspace "db" at my path of /opt/drill/db and I gave it a default storage format of parquet. * create table dfs.db.`crime_data` as select columns[0] as REFID, columns[1] as CaseNumber, columns[2] as DateTime, columns[3] as Block, columns[4] as IUCR, columns[5] as PrimaryType, columns[6] as Description, columns[7] as LocationDescription, columns[8] as Arrest, columns[9] as Domestic, columns[10] as Beat, columns[11] as District, columns[12] as Ward, columns[13] as CommunityArea, columns[14] as FBICode, columns[15] as XCoordinate, columns[16] as YCoordinate, columns[17] as YEAR_DATE, columns[18] as UpdatedOn, columns[19] as Latitude, columns[20] as Longitude, columns[21] as Location from dfs.`/home/jscott/Downloads/chicago/Crimes_-_2001_to_present.csv`; create table dfs.db.`sex_offenders` as select columns[0] as LAST,columns[1] as FIRST,columns[2] as BLOCK,columns[3] as GENDER,columns[4] as RACE,columns[5] as BIRTHDATE,columns[6] as AGE,columns[7] as HEIGHT,columns[8] as WEIGHT,columns[9] as VICTIM_MINOR from dfs.`/home/jscott/Downloads/chicago/Sex_Offenders.csv`; ​ *Run the previous query on the newly created table:*select count(*) as incidents, primarytype, description from dfs.db.`crime_data` group by description, primarytype order by incidents desc limit 10; *How many crimes occurred on a block with a registered sex offender?*SELECT count(*) as incidents FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block ORDER BY incidents DESC; *What types of incidents occurred?*SELECT count(*) as incidents, crimes.primarytype as type, crimes.description as subtype FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block GROUP BY crimes.description, crimes.primarytype ORDER BY incidents DESC LIMIT 20; *What all is in the BATTERY category?*SELECT count(*) as incidents, crimes.description as subtype FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block WHERE crimes.primarytype = 'BATTERY' GROUP BY crimes.description ORDER BY incidents DESC; *How many total BATTERY incidents?*SELECT count(*) as incidents FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block WHERE crimes.primarytype = 'BATTERY'; *How many "BATTERY" incidents occurred on those blocks by year (has it gotten better or worse over time)?*SELECT count(*) as incidents, extract(year from to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) as `year` FROM dfs.db.`crime_data` crimes JOIN dfs.db.`sex_offenders` offenders ON crimes.block = offenders.block WHERE crimes.primarytype = 'BATTERY' and crimes.datetime not like '%Date%' GROUP BY extract(year from to_date(crimes.datetime, 'MM/DD/YYYY hh:mm:ss a')) ORDER BY `year` ASC; ​ ​Do you have queries against this data set that you have written that you would like to share with others? Reply to the thread... ​ -- *Jim Scott* Director, Enterprise Strategy & Architecture <http://www.mapr.com/> [image: MapR Technologies] <http://www.mapr.com>
