[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

ASF GitHub Bot (JIRA) Wed, 17 May 2017 14:18:31 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014792#comment-16014792
 ]


ASF GitHub Bot commented on DRILL-5379:
---------------------------------------

Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/826#discussion_r117112378
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
    @@ -380,14 +384,21 @@ public void endRecord() throws IOException {
     
           // since ParquetFileWriter will overwrite empty output file (append 
is not supported)
           // we need to re-apply file permission
    -      parquetFileWriter = new ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE);
    +      if (useConfiguredBlockSize) {
    +        // Round up blockSize to multiple of 64K.
    +        long writeBlockSize = ((long) 
ceil((double)blockSize/BLOCKSIZE_MULTIPLE)) * BLOCKSIZE_MULTIPLE;
    +        // Passing writeBlockSize creates files with this blockSize 
instead of filesystem default blockSize.
    +        // Currently, this is supported only by filesystems included in
    +        // BLOCK_FS_SCHEMES (ParquetFileWriter.java in parquet-mr), which 
includes HDFS.
    +        // For other filesystems, it uses default blockSize configured for 
the file system.
    +        parquetFileWriter = new ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE, writeBlockSize, 0);
    +      } else {
    +        parquetFileWriter = new ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE);
    --- End diff --
    
    No. It does not. If user had previously changed the parquet block size, 
that will be used in checkBlockSizeReached as before. So, parquet rowGroup size 
will be same as configured parquet blockSize (not default). This option only 
changes the behavior of how we write that parquet row group onto disk i.e. 
single block or not. 


> Set Hdfs Block Size based on Parquet Block Size
> -----------------------------------------------
>
>                 Key: DRILL-5379
>                 URL: https://issues.apache.org/jira/browse/DRILL-5379
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: F Méthot
>             Fix For: Future
>
>
> It seems there a way to force Drill to store CTAS generated parquet file as a 
> single block when using HDFS. Java HDFS API allows to do that, files could be 
> created with the Parquet block-size set in a session or system config.
> Since it is ideal  to have single parquet file per hdfs block.
> Here is the HDFS API that allow to do that:
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> Drill uses the hadoop ParquetFileWriter 
> (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java).
> This is where the file creation occurs so it might be tricky.
> However, ParquetRecordWriter.java 
> (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
>  in Drill creates the ParquetFileWriter with an hadoop configuration object.
> something to explore: Could the block size be set as a property within the 
> Configuration object before passing it to ParquetFileWriter constructor?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

Reply via email to