Re: Feature/Question

Akshay Bhasin (BLOOMBERG/ 731 LEX) Thu, 27 May 2021 10:14:37 -0700

Hi Ted, 

Yes sure - below are the 2 reasons for it -


1) If I run 5 drill machines in a cluster, all connected to a single end point 
at s3, I'll have to use the machines to create the parquet files. Now, there 
are 2 sub questions here - 

         - I'm not sure if a single drill end point is exposed for me to query 
on, or I need to know the IP address/hostname of the machine to run query from. 
For instance, if the 5 machines have 5 IPs - A,B,C,D, & E - I need to know the 
IP to access the drill node. There isn't a unique cluster ID I can use where 
all requests will be load balanced ?

         - What if the node goes down ? For instance, on a single node (say A 
in above example) - one user is running a read query & at the same time I run a 
create table query ? That would block and congest the node. 

2) This is a minor one - and I could be wrong - I'm not sure drill can write to 
s3 bucket. I think you can only put/upload files there, you cannot write to it. 

Best,
Akshay

From: ted.dunn...@gmail.com At: 05/27/21 12:57:07 UTC-4:00To:  Akshay Bhasin 
(BLOOMBERG/ 731 LEX ) 
Cc:  dev@drill.apache.org
Subject: Re: Feature/Question


Akshay,

I don't understand why you can't use Drill to create the parquet files.  Can 
you say more?

Is there a language constraint? A process constraint?

As I hear it, you are asking "I don't want to use Drill to create parquet, I 
want to use something else". The problem is that there are tons of other ways. 
I start with not understanding your needs (coz I think Drill is the easiest way 
for me to create parquet files) and then have no idea which direction you are 
headed.

Just a little more definition could help me (and others) help you.
On Thu, May 27, 2021 at 8:18 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) 
<abhasi...@bloomberg.net> wrote:

Hi Drill Team,

I've another ques - is there a python parquet module you provide/support which 
I can leverage to create .parquet & .parquet.crc files which drill creates. 

I currently have a drill cluster & I want to use it for reading the data but 
not creating the parquet files.

I'm aware of other modules, but I want to preserve the speed & optimization of 
drill - so particularly looking at the module which drill uses to convert files 
to parquet & parquet.crc. 

My end goal here is to have a drill cluster reading data from s3 & a separate 
process to convert data to parquet & parquet.crc files & upload it to s3. 

Best,
Akshay

From: ted.dunn...@gmail.com At: 04/27/21 17:37:43 UTC-4:00To:  Akshay Bhasin 
(BLOOMBERG/ 731 LEX ) 
Cc:  dev@drill.apache.org
Subject: Re: Feature/Question


Akshay,

That's great news!
On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) 
<abhasi...@bloomberg.net> wrote:

Hi Ted,

Thanks for reaching out. Yes - the below worked successfully. 

I was able to create different objects in s3 like 'XXX/YYY/filename', 
'XXX/ZZZ/filename' and able to query like 
SELECT * FROM XXX.

Thanks ! 

Best,
Akshay

From: ted.dunn...@gmail.com At: 04/21/21 17:21:42To:  Akshay Bhasin (BLOOMBERG/ 
731 LEX ) ,  dev@drill.apache.org
Subject: Re: Feature/Question


Akshay,

Yes. It is possible to do what you want from a few different angles.

As you have noted, S3 doesn't have directories. Not really. On the other hand, 
people simulate this using naming schemes and S3 has some support for this.

One of the simplest ways to deal with this is to create a view that explicitly 
mentions every S3 object that you have in your table. The contents of this view 
can get a bit cumbersome, but that shouldn't be a problem since users never 
need to know. You will need to set up a scheduled action to update this view 
occasionally, but that is pretty simple.

The other way is to use a naming scheme with a delimiter such as /. This is 
described at 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
If you do that and have files named (for instance) foo/a.json, foo/b.json, 
foo/c.json and you query 

    select * from s3.`foo`

you should see the contents of a.json, b.json and c.json. See here for 
commentary

I haven't tried this, however, so I am simply going on the reports of others. 
If this works for you, please report your success back here.


On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) 
<abhasi...@bloomberg.net> wrote:

Hi Drill Community, 

I'm Akshay and I'm using Drill for a project I'm working on.

There is this particular use case I want to implement - I want to know if its 
possible.

1) Currently, we have a partition of file system and we create a view on top of 
it. For example, we have below directory structure - 

/home/product/product_name/year/month/day/*parquet
/home/product/product_name_2/year/month/day/*parquet
/home/product/product_name_3/year/month/day/*parquetdev

Now, we create a view over it - 
Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as month, 
`dir3` as day, * from dfs.`/home/product`;

Then, we can query all the data dynamically - 
SELECT * from temp LIMIT 5;

2) Now I want to replicate this behavior via s3. I want to ask if its possible 
- I was able to create a logical directory. But s3 inherently does not support 
directories only objects. 

Therefore, I was curious to know if it is supported/way to do this. I was 
unable to find any documentation on your website related to partitioning data 
on s3.

Thanks for your help.
Best,
Akshay

Re: Feature/Question

Reply via email to