[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

ASF GitHub Bot (JIRA) Tue, 16 May 2017 17:12:56 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013293#comment-16013293
 ]


ASF GitHub Bot commented on DRILL-5379:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/826#discussion_r116886162
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
    @@ -380,14 +384,21 @@ public void endRecord() throws IOException {
     
           // since ParquetFileWriter will overwrite empty output file (append 
is not supported)
           // we need to re-apply file permission
    -      parquetFileWriter = new ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE);
    +      if (useConfiguredBlockSize) {
    --- End diff --
    
    The API `ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE)` will cause the Parquet file writer to set 
the file block size to the greater of the configured files system block size or 
128 MB (the ParquetWriter's row group size). 
    Drill's Parquet writer will use the block size specified in Drill's options 
to create a new Parquet row group when the limit is reached (See 
`ParquetRecodWriter.checkBlockSizeReached()` ). If you set Drill's Parquet 
block size to the larger of the configured file system block size or 128 MB, 
you will get the row group to match the HDFS block size. 
    Which is what the current code does.
    Isn't this what the original JIRA wanted?



> Set Hdfs Block Size based on Parquet Block Size
> -----------------------------------------------
>
>                 Key: DRILL-5379
>                 URL: https://issues.apache.org/jira/browse/DRILL-5379
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: F Méthot
>             Fix For: Future
>
>
> It seems there a way to force Drill to store CTAS generated parquet file as a 
> single block when using HDFS. Java HDFS API allows to do that, files could be 
> created with the Parquet block-size set in a session or system config.
> Since it is ideal  to have single parquet file per hdfs block.
> Here is the HDFS API that allow to do that:
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> Drill uses the hadoop ParquetFileWriter 
> (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java).
> This is where the file creation occurs so it might be tricky.
> However, ParquetRecordWriter.java 
> (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
>  in Drill creates the ParquetFileWriter with an hadoop configuration object.
> something to explore: Could the block size be set as a property within the 
> Configuration object before passing it to ParquetFileWriter constructor?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

Reply via email to