Hi Askshay, Regarding point 2, Drill should be able to write to S3 buckets. -- C
> On May 27, 2021, at 1:09 PM, Akshay Bhasin (BLOOMBERG/ 731 LEX) > <abhasi...@bloomberg.net> wrote: > > Hi Ted, > > Yes sure - below are the 2 reasons for it - > > 1) If I run 5 drill machines in a cluster, all connected to a single end > point at s3, I'll have to use the machines to create the parquet files. Now, > there are 2 sub questions here - > > - I'm not sure if a single drill end point is exposed for me to query > on, or I need to know the IP address/hostname of the machine to run query > from. For instance, if the 5 machines have 5 IPs - A,B,C,D, & E - I need to > know the IP to access the drill node. There isn't a unique cluster ID I can > use where all requests will be load balanced ? > > - What if the node goes down ? For instance, on a single node (say A > in above example) - one user is running a read query & at the same time I run > a create table query ? That would block and congest the node. > > 2) This is a minor one - and I could be wrong - I'm not sure drill can write > to s3 bucket. I think you can only put/upload files there, you cannot write > to it. > > Best, > Akshay > > From: ted.dunn...@gmail.com At: 05/27/21 12:57:07 UTC-4:00To: Akshay Bhasin > (BLOOMBERG/ 731 LEX ) > Cc: dev@drill.apache.org > Subject: Re: Feature/Question > > > Akshay, > > I don't understand why you can't use Drill to create the parquet files. Can > you say more? > > Is there a language constraint? A process constraint? > > As I hear it, you are asking "I don't want to use Drill to create parquet, I > want to use something else". The problem is that there are tons of other > ways. I start with not understanding your needs (coz I think Drill is the > easiest way for me to create parquet files) and then have no idea which > direction you are headed. > > Just a little more definition could help me (and others) help you. > On Thu, May 27, 2021 at 8:18 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) > <abhasi...@bloomberg.net> wrote: > > Hi Drill Team, > > I've another ques - is there a python parquet module you provide/support > which I can leverage to create .parquet & .parquet.crc files which drill > creates. > > I currently have a drill cluster & I want to use it for reading the data but > not creating the parquet files. > > I'm aware of other modules, but I want to preserve the speed & optimization > of drill - so particularly looking at the module which drill uses to convert > files to parquet & parquet.crc. > > My end goal here is to have a drill cluster reading data from s3 & a separate > process to convert data to parquet & parquet.crc files & upload it to s3. > > Best, > Akshay > > From: ted.dunn...@gmail.com At: 04/27/21 17:37:43 UTC-4:00To: Akshay Bhasin > (BLOOMBERG/ 731 LEX ) > Cc: dev@drill.apache.org > Subject: Re: Feature/Question > > > Akshay, > > That's great news! > On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) > <abhasi...@bloomberg.net> wrote: > > Hi Ted, > > Thanks for reaching out. Yes - the below worked successfully. > > I was able to create different objects in s3 like 'XXX/YYY/filename', > 'XXX/ZZZ/filename' and able to query like > SELECT * FROM XXX. > > Thanks ! > > Best, > Akshay > > From: ted.dunn...@gmail.com At: 04/21/21 17:21:42To: Akshay Bhasin > (BLOOMBERG/ 731 LEX ) , dev@drill.apache.org > Subject: Re: Feature/Question > > > Akshay, > > Yes. It is possible to do what you want from a few different angles. > > As you have noted, S3 doesn't have directories. Not really. On the other > hand, people simulate this using naming schemes and S3 has some support for > this. > > One of the simplest ways to deal with this is to create a view that > explicitly mentions every S3 object that you have in your table. The contents > of this view can get a bit cumbersome, but that shouldn't be a problem since > users never need to know. You will need to set up a scheduled action to > update this view occasionally, but that is pretty simple. > > The other way is to use a naming scheme with a delimiter such as /. This is > described at > https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html > If you do that and have files named (for instance) foo/a.json, foo/b.json, > foo/c.json and you query > > select * from s3.`foo` > > you should see the contents of a.json, b.json and c.json. See here for > commentary > > I haven't tried this, however, so I am simply going on the reports of others. > If this works for you, please report your success back here. > > > On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) > <abhasi...@bloomberg.net> wrote: > > Hi Drill Community, > > I'm Akshay and I'm using Drill for a project I'm working on. > > There is this particular use case I want to implement - I want to know if its > possible. > > 1) Currently, we have a partition of file system and we create a view on top > of it. For example, we have below directory structure - > > /home/product/product_name/year/month/day/*parquet > /home/product/product_name_2/year/month/day/*parquet > /home/product/product_name_3/year/month/day/*parquetdev > > Now, we create a view over it - > Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as month, > `dir3` as day, * from dfs.`/home/product`; > > Then, we can query all the data dynamically - > SELECT * from temp LIMIT 5; > > 2) Now I want to replicate this behavior via s3. I want to ask if its > possible - I was able to create a logical directory. But s3 inherently does > not support directories only objects. > > Therefore, I was curious to know if it is supported/way to do this. I was > unable to find any documentation on your website related to partitioning data > on s3. > > Thanks for your help. > Best, > Akshay > >