Re: Feature/Question

Charles Givre Thu, 27 May 2021 10:45:46 -0700

Hi Askshay, 
Regarding point 2, Drill should be able to write to S3 buckets.
-- C


> On May 27, 2021, at 1:09 PM, Akshay Bhasin (BLOOMBERG/ 731 LEX) 
> <[email protected]> wrote:
> 
> Hi Ted, 
> 
> Yes sure - below are the 2 reasons for it - 
> 
> 1) If I run 5 drill machines in a cluster, all connected to a single end 
> point at s3, I'll have to use the machines to create the parquet files. Now, 
> there are 2 sub questions here - 
> 
>         - I'm not sure if a single drill end point is exposed for me to query 
> on, or I need to know the IP address/hostname of the machine to run query 
> from. For instance, if the 5 machines have 5 IPs - A,B,C,D, & E - I need to 
> know the IP to access the drill node. There isn't a unique cluster ID I can 
> use where all requests will be load balanced ?
> 
>         - What if the node goes down ? For instance, on a single node (say A 
> in above example) - one user is running a read query & at the same time I run 
> a create table query ? That would block and congest the node. 
> 
> 2) This is a minor one - and I could be wrong - I'm not sure drill can write 
> to s3 bucket. I think you can only put/upload files there, you cannot write 
> to it. 
> 
> Best,
> Akshay
> 
> From: [email protected] At: 05/27/21 12:57:07 UTC-4:00To:  Akshay Bhasin 
> (BLOOMBERG/ 731 LEX ) 
> Cc:  [email protected]
> Subject: Re: Feature/Question
> 
> 
> Akshay,
> 
> I don't understand why you can't use Drill to create the parquet files.  Can 
> you say more?
> 
> Is there a language constraint? A process constraint?
> 
> As I hear it, you are asking "I don't want to use Drill to create parquet, I 
> want to use something else". The problem is that there are tons of other 
> ways. I start with not understanding your needs (coz I think Drill is the 
> easiest way for me to create parquet files) and then have no idea which 
> direction you are headed.
> 
> Just a little more definition could help me (and others) help you.
> On Thu, May 27, 2021 at 8:18 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) 
> <[email protected]> wrote:
> 
> Hi Drill Team,
> 
> I've another ques - is there a python parquet module you provide/support 
> which I can leverage to create .parquet & .parquet.crc files which drill 
> creates. 
> 
> I currently have a drill cluster & I want to use it for reading the data but 
> not creating the parquet files.
> 
> I'm aware of other modules, but I want to preserve the speed & optimization 
> of drill - so particularly looking at the module which drill uses to convert 
> files to parquet & parquet.crc. 
> 
> My end goal here is to have a drill cluster reading data from s3 & a separate 
> process to convert data to parquet & parquet.crc files & upload it to s3. 
> 
> Best,
> Akshay
> 
> From: [email protected] At: 04/27/21 17:37:43 UTC-4:00To:  Akshay Bhasin 
> (BLOOMBERG/ 731 LEX ) 
> Cc:  [email protected]
> Subject: Re: Feature/Question
> 
> 
> Akshay,
> 
> That's great news!
> On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) 
> <[email protected]> wrote:
> 
> Hi Ted,
> 
> Thanks for reaching out. Yes - the below worked successfully. 
> 
> I was able to create different objects in s3 like 'XXX/YYY/filename', 
> 'XXX/ZZZ/filename' and able to query like 
> SELECT * FROM XXX.
> 
> Thanks ! 
> 
> Best,
> Akshay
> 
> From: [email protected] At: 04/21/21 17:21:42To:  Akshay Bhasin 
> (BLOOMBERG/ 731 LEX ) ,  [email protected]
> Subject: Re: Feature/Question
> 
> 
> Akshay,
> 
> Yes. It is possible to do what you want from a few different angles.
> 
> As you have noted, S3 doesn't have directories. Not really. On the other 
> hand, people simulate this using naming schemes and S3 has some support for 
> this.
> 
> One of the simplest ways to deal with this is to create a view that 
> explicitly mentions every S3 object that you have in your table. The contents 
> of this view can get a bit cumbersome, but that shouldn't be a problem since 
> users never need to know. You will need to set up a scheduled action to 
> update this view occasionally, but that is pretty simple.
> 
> The other way is to use a naming scheme with a delimiter such as /. This is 
> described at 
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
> If you do that and have files named (for instance) foo/a.json, foo/b.json, 
> foo/c.json and you query 
> 
>    select * from s3.`foo`
> 
> you should see the contents of a.json, b.json and c.json. See here for 
> commentary
> 
> I haven't tried this, however, so I am simply going on the reports of others. 
> If this works for you, please report your success back here.
> 
> 
> On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) 
> <[email protected]> wrote:
> 
> Hi Drill Community, 
> 
> I'm Akshay and I'm using Drill for a project I'm working on.
> 
> There is this particular use case I want to implement - I want to know if its 
> possible.
> 
> 1) Currently, we have a partition of file system and we create a view on top 
> of it. For example, we have below directory structure - 
> 
> /home/product/product_name/year/month/day/*parquet
> /home/product/product_name_2/year/month/day/*parquet
> /home/product/product_name_3/year/month/day/*parquetdev
> 
> Now, we create a view over it - 
> Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as month, 
> `dir3` as day, * from dfs.`/home/product`;
> 
> Then, we can query all the data dynamically - 
> SELECT * from temp LIMIT 5;
> 
> 2) Now I want to replicate this behavior via s3. I want to ask if its 
> possible - I was able to create a logical directory. But s3 inherently does 
> not support directories only objects. 
> 
> Therefore, I was curious to know if it is supported/way to do this. I was 
> unable to find any documentation on your website related to partitioning data 
> on s3.
> 
> Thanks for your help.
> Best,
> Akshay
> 
>

Re: Feature/Question

Reply via email to