Re: Feature/Question

Akshay Bhasin (BLOOMBERG/ 731 LEX) Thu, 27 May 2021 08:24:25 -0700

Hi Drill Team,

I've another ques - is there a python parquet module you provide/support which 
I can leverage to create .parquet & .parquet.crc files which drill creates.


I currently have a drill cluster & I want to use it for reading the data but 
not creating the parquet files.

I'm aware of other modules, but I want to preserve the speed & optimization of 
drill - so particularly looking at the module which drill uses to convert files 
to parquet & parquet.crc. 

My end goal here is to have a drill cluster reading data from s3 & a separate 
process to convert data to parquet & parquet.crc files & upload it to s3. 

Best,
Akshay

From: [email protected] At: 04/27/21 17:37:43 UTC-4:00To:  Akshay Bhasin 
(BLOOMBERG/ 731 LEX ) 
Cc:  [email protected]
Subject: Re: Feature/Question


Akshay,

That's great news!
On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) 
<[email protected]> wrote:

Hi Ted,

Thanks for reaching out. Yes - the below worked successfully. 

I was able to create different objects in s3 like 'XXX/YYY/filename', 
'XXX/ZZZ/filename' and able to query like 
SELECT * FROM XXX.

Thanks ! 

Best,
Akshay

From: [email protected] At: 04/21/21 17:21:42To:  Akshay Bhasin (BLOOMBERG/ 
731 LEX ) ,  [email protected]
Subject: Re: Feature/Question


Akshay,

Yes. It is possible to do what you want from a few different angles.

As you have noted, S3 doesn't have directories. Not really. On the other hand, 
people simulate this using naming schemes and S3 has some support for this.

One of the simplest ways to deal with this is to create a view that explicitly 
mentions every S3 object that you have in your table. The contents of this view 
can get a bit cumbersome, but that shouldn't be a problem since users never 
need to know. You will need to set up a scheduled action to update this view 
occasionally, but that is pretty simple.

The other way is to use a naming scheme with a delimiter such as /. This is 
described at 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
If you do that and have files named (for instance) foo/a.json, foo/b.json, 
foo/c.json and you query 

    select * from s3.`foo`

you should see the contents of a.json, b.json and c.json. See here for 
commentary

I haven't tried this, however, so I am simply going on the reports of others. 
If this works for you, please report your success back here.


On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) 
<[email protected]> wrote:

Hi Drill Community, 

I'm Akshay and I'm using Drill for a project I'm working on.

There is this particular use case I want to implement - I want to know if its 
possible.

1) Currently, we have a partition of file system and we create a view on top of 
it. For example, we have below directory structure - 

/home/product/product_name/year/month/day/*parquet
/home/product/product_name_2/year/month/day/*parquet
/home/product/product_name_3/year/month/day/*parquetdev

Now, we create a view over it - 
Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as month, 
`dir3` as day, * from dfs.`/home/product`;

Then, we can query all the data dynamically - 
SELECT * from temp LIMIT 5;

2) Now I want to replicate this behavior via s3. I want to ask if its possible 
- I was able to create a logical directory. But s3 inherently does not support 
directories only objects. 

Therefore, I was curious to know if it is supported/way to do this. I was 
unable to find any documentation on your website related to partitioning data 
on s3.

Thanks for your help.
Best,
Akshay

Re: Feature/Question

Reply via email to