Hello Paul, drillbit supports the mongo distribution, whether it is a replica set or a sharded cluster mode. This approach is friendly to data nodes, just like drill-on-hbase.
Replica set
Sharded
In addition, it is a good idea to give up support for union and repeat lists in a non-projection process, because the goal of Drill is always to drill the raw data and analyze it instead of becoming an ETL. So, we don't have to devote a lot of time to developing features that use very little. > 2024年1月2日 04:33,Paul Rogers <par0...@gmail.com> 写道: > > Hi All, > > My two cents on Charles' other points: about Drill's use with Mongo or > Druid. If this is common, we might want to put more effort into the > integrations above the level of the reader. I'm most familiar with Druid, > so let's use that as an example. > > Druid provides a SQL interface, so it is convenient to forward Drill > queries to Druid as SQL. But, Druid has a very limited distribution > architecture: it is two-level: the coordinator and the data nodes. This > means we've got, say, 10 Drill nodes, that pick one Drill node to be the > reader that talks to the one Druid coordinator, that then talks to, say, 20 > data nodes. This is clearly a bottleneck, and will never perform anywhere > near what Druid's native UI can do. > > So, a better approach is to bypass Druid SQL and use Druid native queries. > Bypass the coordinator and talk directly to the data nodes. Now, we have > our 10 Drill nodes each talking to two Druid data nodes, providing a > parallelism far better than Druid itself provides. Drill's distributed > sort, join and windowing functionality is far more scalable than Druid's > only single-node functionality. > > Druid is optimized for small, simple queries that power dashboards. Druid > frowns on "BI" use cases that touch large chunks of data. In Druid, the > coordinator is the bottleneck: BI queries against the coordinator kill > dashboard SLAs. With the above setup, Drill would provide a wonderful, > scalable BI solution for Druid that does not degrade the system because > Drill would no longer put load on Druid's weak link: the coordinator node. > > Mongo is also distributed. Does it have the same potential to use Drill to > distribute work to avoid a similar bottleneck? > > To give MapR some credit, MapR-DB had a client that allowed distributed > queries. The Drill integration with MapR-DB was supposed to use an approach > similar to the one outlined above for Druid. > > Alas, the above trick won't work for a traditional DBMS using JDBC. > However, if the DB is sharded, then, with the right metadata, Drill could > distribute queries to the shards so the DB's own query system doesn't have > to. > > So there you have it, a fun weekend project for someone familiar with the > details of a particular distributed DB. > > Thanks, > > - Paul > > > On Mon, Jan 1, 2024 at 7:17 AM Charles Givre <cgi...@gmail.com> wrote: > >> To continue the thread hijacking.... >> >> I'd agree with what James is saying. What if we were to create a docker >> container (or some sort of package) that included Drill, Superset and all >> associated configuration stuff so that a user could just run a docker >> command and have a fully functional Drill instance set up with Superset? >> >> Regarding the JSON, for a while we were working on updating all the >> plugins to use EVF2. From my recollection, we got all the formats >> converted except for parquet (major project) and HDF5 (PR pending: >> https://github.com/apache/drill/pull/2515). We had also started working >> on removing the old JSON reader, however, there were a few places it reared >> its head: >> 1. The Druid plugin. I wrote a draft PR that is pending to swap it out >> for the EVF JSON reader but haven't touched it in a really long time. ( >> https://github.com/apache/drill/pull/2657) >> 2. The Mongo plugin: No work there... >> 3. The conversion UDFs. Work started. ( >> https://github.com/apache/drill/pull/2567) >> >> In any event, given the interest in Mongo/Drill, it might be worthwhile to >> take a look at the Mongo plugin to see what it would take to swap out the >> old JSON reader for the EVF one. >> Regarding unprojected columns, if that's the holdup, I'd say scrap that >> feature for complex data types. >> >> What do you think? >> >> >>> On Jan 1, 2024, at 07:57, James Turton <dz...@apache.org> wrote: >>> >>> P.P.S. since I'm spamming this thread today. With >>> >>>> this suggests to me that we should keep putting effort into: embedded >> Drill, Windows support, rapid installation and setup, low "time to insight". >>> >>> I'm not going so far as to suggest that Drill be thought of as desktop >> software, rather that ad hoc Drill deployments working on small (Gb) to big >> (Tb) data may be as, or more, important than long lived, heavily >> integrated, professionally managed deployments working on really Big data >> (Pb). Perhaps the last category belongs almost entirely to BigQuery, >> Athena, Snowflake and the like nowadays anyway. >>> >>> I still think a cluster is the often the most effective way to deploy >> Drill so the question contemplated is really "Can we make it faster and >> easier to spin up a cluster (and embedded Drill), connect to data sources >> and start running (successful) queries"? >>> >>> On 2024/01/01 07:33, James Turton wrote: >>>> P.S. I also have an admittedly vague idea about deprecating the UNION >> data type, which still breaks things in many operators, in favour of a >> different approach where we kick any invalid data encountered while loading >> column FOO out to a generated _FOO_EXCEPTIONS VARCHAR (or VARBINARY, though >> binary data formats tend not to be malformed?) column. This would let a >> query over dirty data complete without invisible data swallowing, and would >> mean we could cut further effort on UNION support. >>>> >>>> On 2024/01/01 07:11, James Turton wrote: >>>>> Happy New Year! >>>>> >>>>> Here's another two cents. Make that five now that I scan this email >> again! >>>>> >>>>> Excluding our Docker Hub images (which are popular), Drill is >> downloaded ~1000 times a month [1] (order of magnitude, it's hard to count >> genuinely new installations from web server downloads). >>>>> >>>>> What roles are these folks in? I'm a data engineer by day and I don't >> think that we count for a large share of those downloads. The DEs I work >> with are risk averse sorts that tend to favour setups with rigid schemas >> early on and no surprises for their users at query time. Add to that a >> second stat from the download data: the biggest single download user OS is >> Windows, at about 50% [1]. Some of these users may go on to copy that >> download to a server environment but I have a theory that many of them go >> on to run embedded Drill right there on beefy Windows laptops. >>>>> >>>>> I conjecture that most of the people reaching for Drill are analysts >> or developers working _away_ from an established, shared data >> infrastructure. There may not be any shared data engineering where they >> are, or they may find themselves in a fashionable "Data Mesh" environment >> [2]. I'm probably abusing Data Mesh a bit here in that I'm told that it >> mainly proposes a federation of distinct data _teams_, rather than of data >> _systems_ but, if you entertain my cynical formulation of "Data Mesh guys! >> Silos aren't uncool any more!" just a bit, then you can well imagine why a >> user in a Data Mesh might look for something like Drill to combine data >> from different silos on their own machine. Tangentially this suggests to me >> that we should keep putting effort into: embedded Drill, Windows support, >> rapid installation and setup, low "time to insight". >>>>> >>>>> MongoDB questions still come up frequently giving a reason beyond the >> JSON files questions to think that the JSON data model is still very >> important. Wherever we decide to bound the current EVF v2 data model >> implementation, maybe we can sketch out a design of whatever is >> unimplemented in some updates to the Drill wiki pages? This would give >> other devs a head start if we decide that some unsupported complex data >> type is worth implementing down the road? >>>>> >>>>> 1. https://infra-reports.apache.org/#downloads&project=drill >>>>> 2. https://martinfowler.com/articles/data-mesh-principles.html >>>>> >>>>> Regards >>>>> James >>>>> >>>>> On 2024/01/01 03:16, Charles Givre wrote: >>>>>> I'll throw my .02 here... As a user of Drill, I've only had the >> occasion to use the Union once. However, when I used it, it consumed so >> much memory, we ended up finding a workaround anyway and stopped using it. >> Honestly, since we improved the implicit casting rules, I think Drill is a >> lot smarter about how it reads data anyway. Bottom line, I do think we >> could drop the union and repeated union. >>>>>> >>>>>> The repeated lists and maps however are unfortunately something that >> does come up a bit. Honestly, I'm not sure what work is remaining here >> but TBH Drill works pretty well at the moment with most of the data I'm >> using it for. This would include some really nasty nested JSON objects. >>>>>> >>>>>> -- C >>>>>> >>>>>> >>>>>>> On Dec 31, 2023, at 01:38, Paul Rogers <par0...@gmail.com> wrote: >>>>>>> >>>>>>> Hi Luoc, >>>>>>> >>>>>>> Thanks for reminding me about the EVF V2 work. I got mostly done >> adding >>>>>>> projection for complex types, then got busy on other projects. I've >> yet to >>>>>>> tackle the hard cases: unions, repeated unions and repeated lists >> (which >>>>>>> are, in fact, repeated repeated unions). >>>>>>> >>>>>>> The code to handle unprojected fields in these areas is getting >> awfully >>>>>>> complicated. In doing that work, and then seeing a trick that Druid >> uses, >>>>>>> I'm tempted to rework the projection bits of the code to use a >> cleaner >>>>>>> approach. However, it might be better to commit the work done thus >> far so >>>>>>> folks can use it before I wander off to take another approach. >>>>>>> >>>>>>> Then, I wondered if anyone actually still uses this stuff. Do you >> still >>>>>>> need the code to handle non-projection of complex types? >>>>>>> >>>>>>> Of course, perhaps no one will ever need the hard cases: I've never >> been >>>>>>> convinced that unions, repeated lists, or arrays of repeated lists >> are >>>>>>> things that any sane data engineer will want to use -- or use more >> than >>>>>>> once. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> - Paul >>>>>>> >>>>>>> >>>>>>> On Sat, Dec 30, 2023 at 10:26 PM James Turton <dz...@apache.org> >> wrote: >>>>>>> >>>>>>>> Hi Luoc and Drill devs! >>>>>>>> >>>>>>>> It's best to email Paul directly since he doesn't follow these lists >>>>>>>> closely. In the meantime I've prepared a PR of backported fixes for >>>>>>>> 1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty >>>>>>>> upgrade that Maksym is working on, and which looks close to done, >>>>>>>> included? There's at least one CVE applicable to our current >> version of >>>>>>>> Netty... >>>>>>>> >>>>>>>> Regards >>>>>>>> James >>>>>>>> >>>>>>>> >>>>>>>> 1. https://github.com/apache/drill/pull/2860 >>>>>>>> >>>>>>>> On 2023/12/11 04:41, luoc wrote: >>>>>>>>> Hello all, >>>>>>>>> 1.22 will be a more stable version. This is a digression: Is >> Paul >>>>>>>> still interested in participating in the EVF V2 refactoring in the >>>>>>>> framework? I would like to offer time to assist him. >>>>>>>>> luoc >>>>>>>>> >>>>>>>>>> 2023年12月9日 01:01,Charles Givre <cgi...@gmail.com> 写道: >>>>>>>>>> >>>>>>>>>> Hello all, >>>>>>>>>> Happy Friday everyone! I wanted to raise the topic of getting a >> Drill >>>>>>>> minor release out the door before the end of the year. My opinion >> is that >>>>>>>> I'd really like to release Drill 1.22 once the integration with >> Apache >>>>>>>> Daffodil is complete, but it sounds like that is still a few weeks >> away. >>>>>>>>>> What does everyone think about issuing a maintenance release >> before the >>>>>>>> end of the year? There are a number of singificant fixes including >> some >>>>>>>> security updates and a major bug in the ES plugin that basically >> makes it >>>>>>>> unusable. >>>>>>>>>> Best, >>>>>>>>>> -- C >>>>>>>> >>>>> >>>> >>> >> >>