Re: Next Version

James Turton Sun, 31 Dec 2023 21:11:41 -0800

Happy New Year!

Here's another two cents. Make that five now that I scan this email again!

Excluding our Docker Hub images (which are popular), Drill is downloaded~1000 times a month [1] (order of magnitude, it's hard to countgenuinely new installations from web server downloads).

What roles are these folks in? I'm a data engineer by day and I don'tthink that we count for a large share of those downloads. The DEs I workwith are risk averse sorts that tend to favour setups with rigid schemasearly on and no surprises for their users at query time. Add to that asecond stat from the download data: the biggest single download user OSis Windows, at about 50% [1]. Some of these users may go on to copy thatdownload to a server environment but I have a theory that many of themgo on to run embedded Drill right there on beefy Windows laptops.

I conjecture that most of the people reaching for Drill are analysts ordevelopers working _away_ from an established, shared datainfrastructure. There may not be any shared data engineering where theyare, or they may find themselves in a fashionable "Data Mesh"environment [2]. I'm probably abusing Data Mesh a bit here in that I'mtold that it mainly proposes a federation of distinct data _teams_,rather than of data _systems_ but, if you entertain my cynicalformulation of "Data Mesh guys! Silos aren't uncool any more!" just abit, then you can well imagine why a user in a Data Mesh might look forsomething like Drill to combine data from different silos on their ownmachine. Tangentially this suggests to me that we should keep puttingeffort into: embedded Drill, Windows support, rapid installation andsetup, low "time to insight".

MongoDB questions still come up frequently giving a reason beyond theJSON files questions to think that the JSON data model is still veryimportant. Wherever we decide to bound the current EVF v2 data modelimplementation, maybe we can sketch out a design of whatever isunimplemented in some updates to the Drill wiki pages? This would giveother devs a head start if we decide that some unsupported complex datatype is worth implementing down the road?


1. https://infra-reports.apache.org/#downloads&project=drill
2. https://martinfowler.com/articles/data-mesh-principles.html

Regards
James

On 2024/01/01 03:16, Charles Givre wrote:

I'll throw my .02 here...  As a user of Drill, I've only had the occasion to 
use the Union once.  However, when I used it, it consumed so much memory, we 
ended up finding a workaround anyway and stopped using it.  Honestly, since we 
improved the implicit casting rules, I think Drill is a lot smarter about how 
it reads data anyway.  Bottom line, I do think we could drop the union and 
repeated union.

The repeated lists and maps however are unfortunately something that does come 
up a bit.   Honestly, I'm not sure what work is remaining here but TBH Drill 
works pretty well at the moment with most of the data I'm using it for.  This 
would include some really nasty nested JSON objects.

-- C

On Dec 31, 2023, at 01:38, Paul Rogers <[email protected]> wrote:

Hi Luoc,

Thanks for reminding me about the EVF V2 work. I got mostly done adding
projection for complex types, then got busy on other projects. I've yet to
tackle the hard cases: unions, repeated unions and repeated lists (which
are, in fact, repeated repeated unions).

The code to handle unprojected fields in these areas is getting awfully
complicated. In doing that work, and then seeing a trick that Druid uses,
I'm tempted to rework the projection bits of the code to use a cleaner
approach. However, it might be better to commit the work done thus far so
folks can use it before I wander off to take another approach.

Then, I wondered if anyone actually still uses this stuff. Do you still
need the code to handle non-projection of complex types?

Of course, perhaps no one will ever need the hard cases: I've never been
convinced that unions, repeated lists, or arrays of repeated lists are
things that any sane data engineer will want to use -- or use more than
once.

Thanks,

- Paul


On Sat, Dec 30, 2023 at 10:26 PM James Turton <[email protected]> wrote:

Hi Luoc and Drill devs!

It's best to email Paul directly since he doesn't follow these lists
closely. In the meantime I've prepared a PR of backported fixes for
1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
upgrade that Maksym is working on, and which looks close to done,
included? There's at least one CVE  applicable to our current version of
Netty...

Regards
James


1. https://github.com/apache/drill/pull/2860

On 2023/12/11 04:41, luoc wrote:

Hello all,
   1.22 will be a more stable version. This is a digression: Is Paul

still interested in participating in the EVF V2 refactoring in the
framework? I would like to offer time to assist him.

luoc

2023年12月9日 01:01，Charles Givre <[email protected]> 写道：

Hello all,
Happy Friday everyone!   I wanted to raise the topic of getting a Drill

minor release out the door before the end of the year.   My opinion is that
I'd really like to release Drill 1.22 once the integration with Apache
Daffodil is complete, but it sounds like that is still a few weeks away.

What does everyone think about issuing a maintenance release before the

end of the year?  There are a number of singificant fixes including some
security updates and a major bug in the ES plugin that basically makes it
unusable.

Best,
-- C

Re: Next Version

Reply via email to