This seems like a reasonable feature request. It could also be expanded to 
detect the underlying block size for the location being written to.


Could you file a JIRA for this?


Thanks

Kunal

________________________________
From: François Méthot <[email protected]>
Sent: Thursday, March 23, 2017 9:08:51 AM
To: [email protected]
Subject: Re: Single Hdfs block per parquet file

After further investigation, Drill uses the hadoop ParquetFileWriter (
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
).
This is where the file creation occurs so it might be tricky after all.

However ParquetRecordWriter.java (
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
in Drill creates the ParquetFileWriter with an hadoop configuration object.

However something to explore: Could the block size be set as a property
within the Configuration object before passing it to ParquetFileWriter
constructor?

François

On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy <[email protected]>
wrote:

> Yes, seems like it is possible to create files with different block sizes.
> We could potentially pass the configured store.parquet.block-size to the
> create call.
> I will try it out and see. will let you know.
>
> Thanks,
> Padma
>
>
> > On Mar 22, 2017, at 4:16 PM, François Méthot <[email protected]>
> wrote:
> >
> > Here are 2 links I could find:
> >
> > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> fs.Path,%20boolean,%20int,%20short,%20long)
> >
> > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> fs.Path,%20boolean,%20int,%20short,%20long)
> >
> > Francois
> >
> > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <[email protected]>
> > wrote:
> >
> >> I think we create one file for each parquet block.
> >> If underlying HDFS block size is 128 MB and parquet block size  is  >
> >> 128MB,
> >> it will create more blocks on HDFS.
> >> Can you let me know what is the HDFS API that would allow you to
> >> do otherwise ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >>> On Mar 22, 2017, at 11:54 AM, François Méthot <[email protected]>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is there a way to force Drill to store CTAS generated parquet file as a
> >>> single block when using HDFS? Java HDFS API allows to do that, files
> >> could
> >>> be created with the Parquet block-size.
> >>>
> >>> We are using Drill on hdfs configured with block size of 128MB.
> Changing
> >>> this size is not an option at this point.
> >>>
> >>> It would be ideal for us to have single parquet file per hdfs block,
> >> setting
> >>> store.parquet.block-size to 128MB would fix our issue but we end up
> with
> >> a
> >>> lot more files to deal with.
> >>>
> >>> Thanks
> >>> Francois
> >>
> >>
>
>

Reply via email to