Hi Ted, Yes sure - below are the 2 reasons for it -
1) If I run 5 drill machines in a cluster, all connected to a single end point at s3, I'll have to use the machines to create the parquet files. Now, there are 2 sub questions here - - I'm not sure if a single drill end point is exposed for me to query on, or I need to know the IP address/hostname of the machine to run query from. For instance, if the 5 machines have 5 IPs - A,B,C,D, & E - I need to know the IP to access the drill node. There isn't a unique cluster ID I can use where all requests will be load balanced ? - What if the node goes down ? For instance, on a single node (say A in above example) - one user is running a read query & at the same time I run a create table query ? That would block and congest the node. 2) This is a minor one - and I could be wrong - I'm not sure drill can write to s3 bucket. I think you can only put/upload files there, you cannot write to it. Best, Akshay From: ted.dunn...@gmail.com At: 05/27/21 12:57:07 UTC-4:00To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) Cc: dev@drill.apache.org Subject: Re: Feature/Question Akshay, I don't understand why you can't use Drill to create the parquet files. Can you say more? Is there a language constraint? A process constraint? As I hear it, you are asking "I don't want to use Drill to create parquet, I want to use something else". The problem is that there are tons of other ways. I start with not understanding your needs (coz I think Drill is the easiest way for me to create parquet files) and then have no idea which direction you are headed. Just a little more definition could help me (and others) help you. On Thu, May 27, 2021 at 8:18 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <abhasi...@bloomberg.net> wrote: Hi Drill Team, I've another ques - is there a python parquet module you provide/support which I can leverage to create .parquet & .parquet.crc files which drill creates. I currently have a drill cluster & I want to use it for reading the data but not creating the parquet files. I'm aware of other modules, but I want to preserve the speed & optimization of drill - so particularly looking at the module which drill uses to convert files to parquet & parquet.crc. My end goal here is to have a drill cluster reading data from s3 & a separate process to convert data to parquet & parquet.crc files & upload it to s3. Best, Akshay From: ted.dunn...@gmail.com At: 04/27/21 17:37:43 UTC-4:00To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) Cc: dev@drill.apache.org Subject: Re: Feature/Question Akshay, That's great news! On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) <abhasi...@bloomberg.net> wrote: Hi Ted, Thanks for reaching out. Yes - the below worked successfully. I was able to create different objects in s3 like 'XXX/YYY/filename', 'XXX/ZZZ/filename' and able to query like SELECT * FROM XXX. Thanks ! Best, Akshay From: ted.dunn...@gmail.com At: 04/21/21 17:21:42To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) , dev@drill.apache.org Subject: Re: Feature/Question Akshay, Yes. It is possible to do what you want from a few different angles. As you have noted, S3 doesn't have directories. Not really. On the other hand, people simulate this using naming schemes and S3 has some support for this. One of the simplest ways to deal with this is to create a view that explicitly mentions every S3 object that you have in your table. The contents of this view can get a bit cumbersome, but that shouldn't be a problem since users never need to know. You will need to set up a scheduled action to update this view occasionally, but that is pretty simple. The other way is to use a naming scheme with a delimiter such as /. This is described at https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html If you do that and have files named (for instance) foo/a.json, foo/b.json, foo/c.json and you query select * from s3.`foo` you should see the contents of a.json, b.json and c.json. See here for commentary I haven't tried this, however, so I am simply going on the reports of others. If this works for you, please report your success back here. On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <abhasi...@bloomberg.net> wrote: Hi Drill Community, I'm Akshay and I'm using Drill for a project I'm working on. There is this particular use case I want to implement - I want to know if its possible. 1) Currently, we have a partition of file system and we create a view on top of it. For example, we have below directory structure - /home/product/product_name/year/month/day/*parquet /home/product/product_name_2/year/month/day/*parquet /home/product/product_name_3/year/month/day/*parquetdev Now, we create a view over it - Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as month, `dir3` as day, * from dfs.`/home/product`; Then, we can query all the data dynamically - SELECT * from temp LIMIT 5; 2) Now I want to replicate this behavior via s3. I want to ask if its possible - I was able to create a logical directory. But s3 inherently does not support directories only objects. Therefore, I was curious to know if it is supported/way to do this. I was unable to find any documentation on your website related to partitioning data on s3. Thanks for your help. Best, Akshay