Drill with No-SQL [was: Cannot Build Drill "exec/Java Execution Engine"]

Paul Rogers Sun, 03 May 2020 14:43:22 -0700

Hi Tug,

Glad to hear from you again. Ted's summary is pretty good; here's a bit more 
detail.

Presto is another alternative which seems to have gained the most traction 
outside of the Cloud ecosystem on the one hand, and the Cloudera/HortonWorks 
ecosystem on the other. Presto does, however, demand that you have a schema, 
which is often an obstacle for many applications.

Most folks I've talked to who tried to use Spark for this use case came away 
disappointed. Unlike Drill (or Presto or Impala), Spark wants to start new Java 
processes for each query. Makes great sense for large, complex map/reduce jobs, 
but is a non-starter for small, interactive queries.

Hive also is trying to be an "uber query layer" and has integrations with 
multiple systems. But, Hive's complexity makes Drill look downright simple by 
comparison. Hive also needs an up-front schema.

I've had the opportunity to integrate Drill with two different noSQL engines. 
Getting started is easy, especially if a REST or similar API is available. 
Filter push-down is the next step as otherwise Drill will simply suck all data 
from your DB as it it were a file. We've added some structure in the new HTTP 
reader to make it a bit easier than it used to be to create this kind of filter 
push-down. (The other kind of filter push-down is for partition pruning used 
for files, which you probably won't need.)

Aside from the current MapR repo issues, Drill tends to be much easier to build 
than other systems. Pretty much set up Java and the correct Maven and you're 
good to go. If you run unit tests, there is one additional library to install, 
but the tests themselves tell you you exactly what is needed when they fail the 
first time (which I how I learned about it.)

After that, performance will point the way. For example, does your DB have 
indexes? If so, then you can leverage the work originally done for MapR-DB to 
convey index information to Calcite so it can pick the best execution plan. 
There are specialized operators for index key lookup as well.

All this will get you the basic one-table scan which is often all that no-SQL 
DBs ever need. (Any structure usually appears within each document, rather than 
as joined table as in the RDBMS world.) However, if your DB does need joins, 
you will need something like Calcite to work out the tradeoffs of the various 
join+filter-push plans possible, especially if your DB supports multiple 
indexes. There is no escaping the plan-time complexity of these cases. Calcite 
is big and complex, but it does give you the tools needed to solve these 
problems.

If your DB is to be used to power dashboards (summaries of logs, time series, 
click streams, sales or whatever), you'll soon find you need to provide a 
caching/aggregation layer to avoid banging on your DB each time the dashboard 
refreshes. (Imagine a 1-week dashboard, updated every minute, where only the 
last hour has new data.) Drill becomes very handy as a way of combining data 
from a mostly-static caching layer (data for the last 6 days, say) with your 
live DB (for the last one day, say.)

If you provide a "writer" as well as a "reader", you can use Drill to load your 
DB as well as query it.

Happy to share whatever else I might have learned if you can describe your 
goals in a bit more detail.

Thanks,
- Paul

    On Sunday, May 3, 2020, 11:25:11 AM PDT, Ted Dunning 
<[email protected]> wrote:  

 The compile problem is a problem with the MapR repo (I think). I have
reported it to the folks who can fix it.

Regarding the generic question, I think that Drill is very much a good
choice for putting a SQL layer on a noSQL database.

It is definitely the case that the community is much broader than it used
to be. A number of companies now use Drill in their products which is
one of the best ways to build long-term community.

There are alternatives, of course. All have trade-offs (because we live in
the world):

- Calcite itself (what Drill uses as a SQL parser and optimizer) can be
used, but you have to provide an execution framework and you wind up with
something that only works for your engine and is unlikely to support
parallel operations. Calcite is used by lots of projects, though, so it is
has a very broad base of support.

- Spark SQL is fairly easy to extend (from what I hear from friends) but
the optimizer doesn't deal well with complicated tradeoffs (precisely
because it is fairly simple). You also wind up with the baggage of spark
which could be good or bad. You would get some parallelism, though. I don't
think that Spark SQL handles complex objects, however.

- Postgres has a long history of having odd things grafted onto it. I know
little about this other than seeing the results. Extending Postgres would
not likely give you any parallelism, but there might be a way to support
complex objects through Postgres JSON object support.

On Sun, May 3, 2020 at 11:09 AM Tugdual Grall <[email protected]> wrote:

> Hello
>
> It has been a long time since I used Drill!
>
> I wanted to build it to start to work on a new datasource,.
>
> But when run  "mvn clean install", I hit the exception below.
>
> => Can somebody help?
>
> => This bring me to a generic question, if I want to expose a NoSQL
> database using SQL/JDBC/ODBC for Analytics purposes, is Drill the best
> option? or I should look at something else?
>
>
> Thanks!
>
> ====
> [INFO] exec/Java Execution Engine ......................... FAILURE [
>  0.676 s]
>
> [ERROR] Failed to execute goal on project drill-java-exec: Could not
> resolve dependencies
> for project org.apache.drill.exec:drill-java-exec:jar:1.18.0-SNAPSHOT:
> Failed to collect dependencies at org.kohsuke:libpam4j:jar:1.8-rev2: Failed
> to read artifact descriptor for org.kohsuke:libpam4j:jar:1.8-rev2: Could
> not transfer artifact org.kohsuke:libpam4j:pom:1.8-rev2 from/to
> mapr-releases (http://repository.mapr.com/maven/): Transfer failed for
>
> http://repository.mapr.com/maven/org/kohsuke/libpam4j/1.8-rev2/libpam4j-1.8-rev2.pom
> 500 Proxy Error -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
>
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]  mvn <args> -rf :drill-java-exec
>

Drill with No-SQL [was: Cannot Build Drill "exec/Java Execution Engine"]

Reply via email to