This seems like a reasonable feature request. It could also be expanded to detect the underlying block size for the location being written to.
Could you file a JIRA for this? Thanks Kunal ________________________________ From: François Méthot <[email protected]> Sent: Thursday, March 23, 2017 9:08:51 AM To: [email protected] Subject: Re: Single Hdfs block per parquet file After further investigation, Drill uses the hadoop ParquetFileWriter ( https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java ). This is where the file creation occurs so it might be tricky after all. However ParquetRecordWriter.java ( https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java) in Drill creates the ParquetFileWriter with an hadoop configuration object. However something to explore: Could the block size be set as a property within the Configuration object before passing it to ParquetFileWriter constructor? François On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy <[email protected]> wrote: > Yes, seems like it is possible to create files with different block sizes. > We could potentially pass the configured store.parquet.block-size to the > create call. > I will try it out and see. will let you know. > > Thanks, > Padma > > > > On Mar 22, 2017, at 4:16 PM, François Méthot <[email protected]> > wrote: > > > > Here are 2 links I could find: > > > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/ > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop. > fs.Path,%20boolean,%20int,%20short,%20long) > > > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/ > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop. > fs.Path,%20boolean,%20int,%20short,%20long) > > > > Francois > > > > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <[email protected]> > > wrote: > > > >> I think we create one file for each parquet block. > >> If underlying HDFS block size is 128 MB and parquet block size is > > >> 128MB, > >> it will create more blocks on HDFS. > >> Can you let me know what is the HDFS API that would allow you to > >> do otherwise ? > >> > >> Thanks, > >> Padma > >> > >> > >>> On Mar 22, 2017, at 11:54 AM, François Méthot <[email protected]> > >> wrote: > >>> > >>> Hi, > >>> > >>> Is there a way to force Drill to store CTAS generated parquet file as a > >>> single block when using HDFS? Java HDFS API allows to do that, files > >> could > >>> be created with the Parquet block-size. > >>> > >>> We are using Drill on hdfs configured with block size of 128MB. > Changing > >>> this size is not an option at this point. > >>> > >>> It would be ideal for us to have single parquet file per hdfs block, > >> setting > >>> store.parquet.block-size to 128MB would fix our issue but we end up > with > >> a > >>> lot more files to deal with. > >>> > >>> Thanks > >>> Francois > >> > >> > >
