Re: Next Version

luoc Tue, 02 Jan 2024 00:58:09 -0800

Hello Paul,

drillbit supports the mongo distribution, whether it is a replica set or a 
sharded cluster mode. This approach is friendly to data nodes, just like 
drill-on-hbase.


Replica set


Sharded


In addition, it is a good idea to give up support for union and repeat lists in 
a non-projection process, because the goal of Drill is always to drill the raw 
data and analyze it instead of becoming an ETL.

So, we don't have to devote a lot of time to developing features that use very 
little.

> 2024年1月2日 04:33，Paul Rogers <par0...@gmail.com> 写道：
> 
> Hi All,
> 
> My two cents on Charles' other points: about Drill's use with Mongo or
> Druid. If this is common, we might want to put more effort into the
> integrations above the level of the reader. I'm most familiar with Druid,
> so let's use that as an example.
> 
> Druid provides a SQL interface, so it is convenient to forward Drill
> queries to Druid as SQL. But, Druid has a very limited distribution
> architecture: it is two-level: the coordinator and the data nodes. This
> means we've got, say, 10 Drill nodes, that pick one Drill node to be the
> reader that talks to the one Druid coordinator, that then talks to, say, 20
> data nodes. This is clearly a bottleneck, and will never perform anywhere
> near what Druid's native UI can do.
> 
> So, a better approach is to bypass Druid SQL and use Druid native queries.
> Bypass the coordinator and talk directly to the data nodes. Now, we have
> our 10 Drill nodes each talking to two Druid data nodes, providing a
> parallelism far better than Druid itself provides. Drill's distributed
> sort, join and windowing functionality is far more scalable than Druid's
> only single-node functionality.
> 
> Druid is optimized for small, simple queries that power dashboards. Druid
> frowns on "BI" use cases that touch large chunks of data. In Druid, the
> coordinator is the bottleneck: BI queries against the coordinator kill
> dashboard SLAs. With the above setup, Drill would provide a wonderful,
> scalable BI solution for Druid that does not degrade the system because
> Drill would no longer put load on Druid's weak link: the coordinator node.
> 
> Mongo is also distributed. Does it have the same potential to use Drill to
> distribute work to avoid a similar bottleneck?
> 
> To give MapR some credit, MapR-DB had a client that allowed distributed
> queries. The Drill integration with MapR-DB was supposed to use an approach
> similar to the one outlined above for Druid.
> 
> Alas, the above trick won't work for a traditional DBMS using JDBC.
> However, if the DB is sharded, then, with the right metadata, Drill could
> distribute queries to the shards so the DB's own query system doesn't have
> to.
> 
> So there you have it, a fun weekend project for someone familiar with the
> details of a particular distributed DB.
> 
> Thanks,
> 
> - Paul
> 
> 
> On Mon, Jan 1, 2024 at 7:17 AM Charles Givre <cgi...@gmail.com> wrote:
> 
>> To continue the thread hijacking....
>> 
>> I'd agree with what James is saying.  What if we were to create a docker
>> container (or some sort of package) that included Drill, Superset and all
>> associated configuration stuff so that a user could just run a docker
>> command and have a fully functional Drill instance set up with Superset?
>> 
>> Regarding the JSON, for a while we were working on updating all the
>> plugins to use EVF2.  From my recollection, we got all the formats
>> converted except for parquet (major project) and HDF5 (PR pending:
>> https://github.com/apache/drill/pull/2515).  We had also started working
>> on removing the old JSON reader, however, there were a few places it reared
>> its head:
>> 1.  The Druid plugin.  I wrote a draft PR that is pending to swap it out
>> for the EVF JSON reader but haven't touched it in a really long time. (
>> https://github.com/apache/drill/pull/2657)
>> 2.  The Mongo plugin:  No work there...
>> 3.  The conversion UDFs.   Work started.  (
>> https://github.com/apache/drill/pull/2567)
>> 
>> In any event, given the interest in Mongo/Drill, it might be worthwhile to
>> take a look at the Mongo plugin to see what it would take to swap out the
>> old JSON reader for the EVF one.
>> Regarding unprojected columns, if that's the holdup, I'd say scrap that
>> feature for complex data types.
>> 
>> What do you think?
>> 
>> 
>>> On Jan 1, 2024, at 07:57, James Turton <dz...@apache.org> wrote:
>>> 
>>> P.P.S. since I'm spamming this thread today. With
>>> 
>>>> this suggests to me that we should keep putting effort into: embedded
>> Drill, Windows support, rapid installation and setup, low "time to insight".
>>> 
>>> I'm not going so far as to suggest that Drill be thought of as desktop
>> software, rather that ad hoc Drill deployments working on small (Gb) to big
>> (Tb) data may be as, or more, important than long lived, heavily
>> integrated, professionally managed deployments working on really Big data
>> (Pb). Perhaps the last category belongs almost entirely to BigQuery,
>> Athena, Snowflake and the like nowadays anyway.
>>> 
>>> I still think a cluster is the often the most effective way to deploy
>> Drill so the question contemplated is really "Can we make it faster and
>> easier to spin up a cluster (and embedded Drill), connect to data sources
>> and start running (successful) queries"?
>>> 
>>> On 2024/01/01 07:33, James Turton wrote:
>>>> P.S. I also have an admittedly vague idea about deprecating the UNION
>> data type, which still breaks things in many operators, in favour of a
>> different approach where we kick any invalid data encountered while loading
>> column FOO out to a generated _FOO_EXCEPTIONS VARCHAR (or VARBINARY, though
>> binary data formats tend not to be malformed?) column. This would let a
>> query over dirty data complete without invisible data swallowing, and would
>> mean we could cut further effort on UNION support.
>>>> 
>>>> On 2024/01/01 07:11, James Turton wrote:
>>>>> Happy New Year!
>>>>> 
>>>>> Here's another two cents. Make that five now that I scan this email
>> again!
>>>>> 
>>>>> Excluding our Docker Hub images (which are popular), Drill is
>> downloaded ~1000 times a month [1] (order of magnitude, it's hard to count
>> genuinely new installations from web server downloads).
>>>>> 
>>>>> What roles are these folks in? I'm a data engineer by day and I don't
>> think that we count for a large share of those downloads. The DEs I work
>> with are risk averse sorts that tend to favour setups with rigid schemas
>> early on and no surprises for their users at query time. Add to that a
>> second stat from the download data: the biggest single download user OS is
>> Windows, at about 50% [1]. Some of these users may go on to copy that
>> download to a server environment but I have a theory that many of them go
>> on to run embedded Drill right there on beefy Windows laptops.
>>>>> 
>>>>> I conjecture that most of the people reaching for Drill are analysts
>> or developers working _away_ from an established, shared data
>> infrastructure. There may not be any shared data engineering where they
>> are, or they may find themselves in a fashionable "Data Mesh" environment
>> [2]. I'm probably abusing Data Mesh a bit here in that I'm told that it
>> mainly proposes a federation of distinct data _teams_, rather than of data
>> _systems_ but, if you entertain my cynical formulation of "Data Mesh guys!
>> Silos aren't uncool any more!" just a bit, then you can well imagine why a
>> user in a Data Mesh might look for something like Drill to combine data
>> from different silos on their own machine. Tangentially this suggests to me
>> that we should keep putting effort into: embedded Drill, Windows support,
>> rapid installation and setup, low "time to insight".
>>>>> 
>>>>> MongoDB questions still come up frequently giving a reason beyond the
>> JSON files questions to think that the JSON data model is still very
>> important. Wherever we decide to bound the current EVF v2 data model
>> implementation, maybe we can sketch out a design of whatever is
>> unimplemented in some updates to the Drill wiki pages? This would give
>> other devs a head start if we decide that some unsupported complex data
>> type is worth implementing down the road?
>>>>> 
>>>>> 1. https://infra-reports.apache.org/#downloads&project=drill
>>>>> 2. https://martinfowler.com/articles/data-mesh-principles.html
>>>>> 
>>>>> Regards
>>>>> James
>>>>> 
>>>>> On 2024/01/01 03:16, Charles Givre wrote:
>>>>>> I'll throw my .02 here...  As a user of Drill, I've only had the
>> occasion to use the Union once. However, when I used it, it consumed so
>> much memory, we ended up finding a workaround anyway and stopped using it.
>> Honestly, since we improved the implicit casting rules, I think Drill is a
>> lot smarter about how it reads data anyway. Bottom line, I do think we
>> could drop the union and repeated union.
>>>>>> 
>>>>>> The repeated lists and maps however are unfortunately something that
>> does come up a bit.   Honestly, I'm not sure what work is remaining here
>> but TBH Drill works pretty well at the moment with most of the data I'm
>> using it for.  This would include some really nasty nested JSON objects.
>>>>>> 
>>>>>> -- C
>>>>>> 
>>>>>> 
>>>>>>> On Dec 31, 2023, at 01:38, Paul Rogers <par0...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Luoc,
>>>>>>> 
>>>>>>> Thanks for reminding me about the EVF V2 work. I got mostly done
>> adding
>>>>>>> projection for complex types, then got busy on other projects. I've
>> yet to
>>>>>>> tackle the hard cases: unions, repeated unions and repeated lists
>> (which
>>>>>>> are, in fact, repeated repeated unions).
>>>>>>> 
>>>>>>> The code to handle unprojected fields in these areas is getting
>> awfully
>>>>>>> complicated. In doing that work, and then seeing a trick that Druid
>> uses,
>>>>>>> I'm tempted to rework the projection bits of the code to use a
>> cleaner
>>>>>>> approach. However, it might be better to commit the work done thus
>> far so
>>>>>>> folks can use it before I wander off to take another approach.
>>>>>>> 
>>>>>>> Then, I wondered if anyone actually still uses this stuff. Do you
>> still
>>>>>>> need the code to handle non-projection of complex types?
>>>>>>> 
>>>>>>> Of course, perhaps no one will ever need the hard cases: I've never
>> been
>>>>>>> convinced that unions, repeated lists, or arrays of repeated lists
>> are
>>>>>>> things that any sane data engineer will want to use -- or use more
>> than
>>>>>>> once.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> - Paul
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Dec 30, 2023 at 10:26 PM James Turton <dz...@apache.org>
>> wrote:
>>>>>>> 
>>>>>>>> Hi Luoc and Drill devs!
>>>>>>>> 
>>>>>>>> It's best to email Paul directly since he doesn't follow these lists
>>>>>>>> closely. In the meantime I've prepared a PR of backported fixes for
>>>>>>>> 1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
>>>>>>>> upgrade that Maksym is working on, and which looks close to done,
>>>>>>>> included? There's at least one CVE  applicable to our current
>> version of
>>>>>>>> Netty...
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> James
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 1. https://github.com/apache/drill/pull/2860
>>>>>>>> 
>>>>>>>> On 2023/12/11 04:41, luoc wrote:
>>>>>>>>> Hello all,
>>>>>>>>>   1.22 will be a more stable version. This is a digression: Is
>> Paul
>>>>>>>> still interested in participating in the EVF V2 refactoring in the
>>>>>>>> framework? I would like to offer time to assist him.
>>>>>>>>> luoc
>>>>>>>>> 
>>>>>>>>>> 2023年12月9日 01:01，Charles Givre <cgi...@gmail.com> 写道：
>>>>>>>>>> 
>>>>>>>>>> Hello all,
>>>>>>>>>> Happy Friday everyone!   I wanted to raise the topic of getting a
>> Drill
>>>>>>>> minor release out the door before the end of the year. My opinion
>> is that
>>>>>>>> I'd really like to release Drill 1.22 once the integration with
>> Apache
>>>>>>>> Daffodil is complete, but it sounds like that is still a few weeks
>> away.
>>>>>>>>>> What does everyone think about issuing a maintenance release
>> before the
>>>>>>>> end of the year?  There are a number of singificant fixes including
>> some
>>>>>>>> security updates and a major bug in the ES plugin that basically
>> makes it
>>>>>>>> unusable.
>>>>>>>>>> Best,
>>>>>>>>>> -- C
>>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Next Version

Reply via email to