Happy New Year!
Here's another two cents. Make that five now that I scan this email again!
Excluding our Docker Hub images (which are popular), Drill is downloaded
~1000 times a month [1] (order of magnitude, it's hard to count
genuinely new installations from web server downloads).
What roles are these folks in? I'm a data engineer by day and I don't
think that we count for a large share of those downloads. The DEs I work
with are risk averse sorts that tend to favour setups with rigid schemas
early on and no surprises for their users at query time. Add to that a
second stat from the download data: the biggest single download user OS
is Windows, at about 50% [1]. Some of these users may go on to copy that
download to a server environment but I have a theory that many of them
go on to run embedded Drill right there on beefy Windows laptops.
I conjecture that most of the people reaching for Drill are analysts or
developers working _away_ from an established, shared data
infrastructure. There may not be any shared data engineering where they
are, or they may find themselves in a fashionable "Data Mesh"
environment [2]. I'm probably abusing Data Mesh a bit here in that I'm
told that it mainly proposes a federation of distinct data _teams_,
rather than of data _systems_ but, if you entertain my cynical
formulation of "Data Mesh guys! Silos aren't uncool any more!" just a
bit, then you can well imagine why a user in a Data Mesh might look for
something like Drill to combine data from different silos on their own
machine. Tangentially this suggests to me that we should keep putting
effort into: embedded Drill, Windows support, rapid installation and
setup, low "time to insight".
MongoDB questions still come up frequently giving a reason beyond the
JSON files questions to think that the JSON data model is still very
important. Wherever we decide to bound the current EVF v2 data model
implementation, maybe we can sketch out a design of whatever is
unimplemented in some updates to the Drill wiki pages? This would give
other devs a head start if we decide that some unsupported complex data
type is worth implementing down the road?
1. https://infra-reports.apache.org/#downloads&project=drill
2. https://martinfowler.com/articles/data-mesh-principles.html
Regards
James
On 2024/01/01 03:16, Charles Givre wrote:
I'll throw my .02 here... As a user of Drill, I've only had the occasion to
use the Union once. However, when I used it, it consumed so much memory, we
ended up finding a workaround anyway and stopped using it. Honestly, since we
improved the implicit casting rules, I think Drill is a lot smarter about how
it reads data anyway. Bottom line, I do think we could drop the union and
repeated union.
The repeated lists and maps however are unfortunately something that does come
up a bit. Honestly, I'm not sure what work is remaining here but TBH Drill
works pretty well at the moment with most of the data I'm using it for. This
would include some really nasty nested JSON objects.
-- C
On Dec 31, 2023, at 01:38, Paul Rogers <par0...@gmail.com> wrote:
Hi Luoc,
Thanks for reminding me about the EVF V2 work. I got mostly done adding
projection for complex types, then got busy on other projects. I've yet to
tackle the hard cases: unions, repeated unions and repeated lists (which
are, in fact, repeated repeated unions).
The code to handle unprojected fields in these areas is getting awfully
complicated. In doing that work, and then seeing a trick that Druid uses,
I'm tempted to rework the projection bits of the code to use a cleaner
approach. However, it might be better to commit the work done thus far so
folks can use it before I wander off to take another approach.
Then, I wondered if anyone actually still uses this stuff. Do you still
need the code to handle non-projection of complex types?
Of course, perhaps no one will ever need the hard cases: I've never been
convinced that unions, repeated lists, or arrays of repeated lists are
things that any sane data engineer will want to use -- or use more than
once.
Thanks,
- Paul
On Sat, Dec 30, 2023 at 10:26 PM James Turton <dz...@apache.org> wrote:
Hi Luoc and Drill devs!
It's best to email Paul directly since he doesn't follow these lists
closely. In the meantime I've prepared a PR of backported fixes for
1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
upgrade that Maksym is working on, and which looks close to done,
included? There's at least one CVE applicable to our current version of
Netty...
Regards
James
1. https://github.com/apache/drill/pull/2860
On 2023/12/11 04:41, luoc wrote:
Hello all,
1.22 will be a more stable version. This is a digression: Is Paul
still interested in participating in the EVF V2 refactoring in the
framework? I would like to offer time to assist him.
luoc
2023年12月9日 01:01,Charles Givre <cgi...@gmail.com> 写道:
Hello all,
Happy Friday everyone! I wanted to raise the topic of getting a Drill
minor release out the door before the end of the year. My opinion is that
I'd really like to release Drill 1.22 once the integration with Apache
Daffodil is complete, but it sounds like that is still a few weeks away.
What does everyone think about issuing a maintenance release before the
end of the year? There are a number of singificant fixes including some
security updates and a major bug in the ES plugin that basically makes it
unusable.
Best,
-- C