Re: Drill with Hadoop cluster

Charles Givre Wed, 17 Nov 2021 05:15:16 -0800

Hi Sidd, 
Welcome to Drill.  See inline:

> On Nov 17, 2021, at 4:21 AM, Siddharth Jain 
> <[email protected]> wrote:
> 
> Hi,
> 
> I am evaluating Drill for requirement to query the HDFS cluster where the 
> data is stored in parquet file format.
> I was able to setup a Drill cluster of 3 Nodes with zookeeper after following 
> some links.
> On the storage plugin I setup the hdfs with connection to my hdfs URL and can 
> successfully write SQL query in drill web UI and get the results but this on 
> gets data of 1 node only.
> 
> I now have some basics questions-
> 1. Does the storage plugin needs to point to master node of HDFS cluster?

Drill needs to point to the name node in your Hadoop cluster.  See 
https://drill.apache.org/docs/file-system-storage-plugin/ 
<https://drill.apache.org/docs/file-system-storage-plugin/>

> 2. Once a SQL query is fired will it fetch data from all nodes in the cluster 
> or just one node?

Drill will fetch data from all nodes in the cluster if properly configured.

>     OR I have to setup the drill on yarn 
> (https://drill.apache.org/docs/drill-on-yarn-introduction/ 
> <https://drill.apache.org/docs/drill-on-yarn-introduction/>) to get result 
> from all nodes?

It is not a requirement to use Drill on Yarn to get all your data.

> 3. My requirement is to use JDBC to query the HDFS cluster (the search data 
> can go large) in real time  and display result in web UI, do let me know if 
> Drill will be a
>     good fit for this use case

That seems like a reasonable use case for Drill.

> 4. Do we have any performance bench marks of Drill against Presto and Impala?

I do not believe that we have current benchmarks, but let me also say that 
Drill (and Presto and Impala's) performance will vary on a lot of factors to 
the point where I don't really think the benchmarks mean very much.  For 
example, Impala was optimized for the TPC-H queries.  If you try to compare 
other products to Impala on those benchmarks, you'll find that it will likely 
perform better than most.  What you'll also find is that in real life, your 
users aren't querying TPC-H datasets, and that Impala's performance drops off 
significantly for other use cases. 

The performance is closely linked with configuring Drill to match up with how 
your data is structured.

Also, you'll want to make use of the metastore:  
https://drill.apache.org/docs/using-drill-metastore/ 
<https://drill.apache.org/docs/using-drill-metastore/>. This is not mandatory 
but will improve performance on large datasets.

I hope this helps and if you have any questions, please email the list or join 
our slack channel.
Best,
-- C

> 
> Thanks in advance,
> Sidd
> 
> 
> 
> 
> 
> 
> 
>

Re: Drill with Hadoop cluster

Reply via email to