Hi Drill Team, I've another ques - is there a python parquet module you provide/support which I can leverage to create .parquet & .parquet.crc files which drill creates.
I currently have a drill cluster & I want to use it for reading the data but not creating the parquet files. I'm aware of other modules, but I want to preserve the speed & optimization of drill - so particularly looking at the module which drill uses to convert files to parquet & parquet.crc. My end goal here is to have a drill cluster reading data from s3 & a separate process to convert data to parquet & parquet.crc files & upload it to s3. Best, Akshay From: ted.dunn...@gmail.com At: 04/27/21 17:37:43 UTC-4:00To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) Cc: dev@drill.apache.org Subject: Re: Feature/Question Akshay, That's great news! On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) <abhasi...@bloomberg.net> wrote: Hi Ted, Thanks for reaching out. Yes - the below worked successfully. I was able to create different objects in s3 like 'XXX/YYY/filename', 'XXX/ZZZ/filename' and able to query like SELECT * FROM XXX. Thanks ! Best, Akshay From: ted.dunn...@gmail.com At: 04/21/21 17:21:42To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) , dev@drill.apache.org Subject: Re: Feature/Question Akshay, Yes. It is possible to do what you want from a few different angles. As you have noted, S3 doesn't have directories. Not really. On the other hand, people simulate this using naming schemes and S3 has some support for this. One of the simplest ways to deal with this is to create a view that explicitly mentions every S3 object that you have in your table. The contents of this view can get a bit cumbersome, but that shouldn't be a problem since users never need to know. You will need to set up a scheduled action to update this view occasionally, but that is pretty simple. The other way is to use a naming scheme with a delimiter such as /. This is described at https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html If you do that and have files named (for instance) foo/a.json, foo/b.json, foo/c.json and you query select * from s3.`foo` you should see the contents of a.json, b.json and c.json. See here for commentary I haven't tried this, however, so I am simply going on the reports of others. If this works for you, please report your success back here. On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <abhasi...@bloomberg.net> wrote: Hi Drill Community, I'm Akshay and I'm using Drill for a project I'm working on. There is this particular use case I want to implement - I want to know if its possible. 1) Currently, we have a partition of file system and we create a view on top of it. For example, we have below directory structure - /home/product/product_name/year/month/day/*parquet /home/product/product_name_2/year/month/day/*parquet /home/product/product_name_3/year/month/day/*parquetdev Now, we create a view over it - Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as month, `dir3` as day, * from dfs.`/home/product`; Then, we can query all the data dynamically - SELECT * from temp LIMIT 5; 2) Now I want to replicate this behavior via s3. I want to ask if its possible - I was able to create a logical directory. But s3 inherently does not support directories only objects. Therefore, I was curious to know if it is supported/way to do this. I was unable to find any documentation on your website related to partitioning data on s3. Thanks for your help. Best, Akshay