[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

ASF GitHub Bot (JIRA) Wed, 17 May 2017 12:53:17 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014682#comment-16014682
 ]


ASF GitHub Bot commented on DRILL-5379:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/826#discussion_r117076766
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
    @@ -160,6 +160,9 @@
       OptionValidator OUTPUT_FORMAT_VALIDATOR = new 
StringValidator(OUTPUT_FORMAT_OPTION, "parquet");
       String PARQUET_BLOCK_SIZE = "store.parquet.block-size";
       OptionValidator PARQUET_BLOCK_SIZE_VALIDATOR = new 
LongValidator(PARQUET_BLOCK_SIZE, 512*1024*1024);
    +  String PARQUET_WRITER_USE_CONFIGURED_BLOCK_SIZE = 
"store.parquet.writer.use_configured_block-size";
    --- End diff --
    
    This could be better named. The Parquet writer already uses the configured 
block size (the Parquet block size). You really want an option to create the 
file with the same HDFS block size as the Parquet row group for performance 
reasons. Why not name it to indicate that?  


> Set Hdfs Block Size based on Parquet Block Size
> -----------------------------------------------
>
>                 Key: DRILL-5379
>                 URL: https://issues.apache.org/jira/browse/DRILL-5379
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: F Méthot
>             Fix For: Future
>
>
> It seems there a way to force Drill to store CTAS generated parquet file as a 
> single block when using HDFS. Java HDFS API allows to do that, files could be 
> created with the Parquet block-size set in a session or system config.
> Since it is ideal  to have single parquet file per hdfs block.
> Here is the HDFS API that allow to do that:
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> Drill uses the hadoop ParquetFileWriter 
> (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java).
> This is where the file creation occurs so it might be tricky.
> However, ParquetRecordWriter.java 
> (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
>  in Drill creates the ParquetFileWriter with an hadoop configuration object.
> something to explore: Could the block size be set as a property within the 
> Configuration object before passing it to ParquetFileWriter constructor?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

Reply via email to