Tridib If you are getting started with Drill, you can also refer to a tutorial which goes through various Drill's capabilities. https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Tutorial
You are spot on the metadata part. Discovering metadata dynamically and providing ability to work with complex datatypes such as JSON without transformation is a key difference for Drill compared to SparkSQL and other SQL options. -Neeraja On Wed, Oct 29, 2014 at 11:12 AM, Tridib Samanta <[email protected]> wrote: > Hi Adam, > Thanks for sharing this! Apache Drill is very easy to get started. I liked > the part that Drill manages the meta data part by itself and does not > required Hive (like Spark). > > Thanks > Tridib > > > Date: Wed, 29 Oct 2014 10:50:37 -0700 > > Subject: Re: Apache Drill Vs Spark SQL > > From: [email protected] > > To: [email protected] > > > > Hi Tridib, > > > > I just completed a simple evaluation of Drill 0.6.0 and Spark SQL > 1.1.0. I > > ran a few queries over 14GB of Snappy compressed Parquet files on a four > > server MapR cluster (96 cores, 256 GB). Here are the results. > > > > Spark SQL requires some very very minor setup, where Drill doesn't. > > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > > val testData = sqlContext.parquetFile("/user/ahunt/test/2014/10/28/") > > testData.registerTempTable("testData") > > > > In Drill, a simple count query took 19s the first time and 0.9s the > second > > time > > SELECT count(*) FROM dfs.`/user/ahunt/test/2014/10/28/part-*`; > > > > In Spark SQL, it took 17s the first time and 1.7s the second > > sqlContext.sql("SELECT count(*) FROM > testData").collect().foreach(println) > > > > In Drill, a simple group by query printed the results, but would not > return > > to the prompt without hitting ctrl-c (after 6s). > > SELECT httpResponseCode, count(*) FROM > > dfs.`/user/ahunt/test/2014/10/28/part-*` GROUP BY httpResponseCode; > > > > In Spark SQL, it finished in 3.6s > > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY > > httpResponseCode").collect().foreach(println) > > > > In Drill, this query never finished (probably due to the issue described > > above). > > SELECT httpResponseCode, count(*) FROM > > dfs.`/user/ahunt/test/2014/10/28/` GROUP > > BY httpResponseCode ORDER BY httpResponseCode DESC; > > > > In Spark SQL, the same query finished in 5s. > > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY > > httpResponseCode ORDER BY httpResponseCode > DESC").collect().foreach(println) > > > > Although Drill seems very promising, it seems that it has a few issues to > > work out, and since I already use Spark I'm going to stick with Spark SQL > > for now. > > > > Adam > > > > > > On Wed, Oct 29, 2014 at 10:00 AM, Tridib Samanta < > [email protected]> > > wrote: > > > > > Hello Experts, > > > I am new in Apache Drill. To me it's very similar to Spark SQL. I was > > > wandering how does it differ from Spark SQL. What are the use case > where > > > Apache Drill thrives compare to Spark SQL? > > > > > > Thanks & Regards > > > Tridib > > > > >
