[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

ASF GitHub Bot (JIRA) Wed, 28 Jun 2017 22:47:02 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067777#comment-16067777
 ]


ASF GitHub Bot commented on DRILL-5379:
---------------------------------------

Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/826
  
    @kkhatua  HDFS allows specifying block size during file creation which 
overrides the default file system block size. With this PR, we can have single 
HDFS block per Parquet file that can be larger than file system block size , 
without changing file system default block size. I know it is confusing. But, 
it is possible to create file with  block size that is different from file 
system default block size. 
    
     


> Set Hdfs Block Size based on Parquet Block Size
> -----------------------------------------------
>
>                 Key: DRILL-5379
>                 URL: https://issues.apache.org/jira/browse/DRILL-5379
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: F Méthot
>            Assignee: Padma Penumarthy
>              Labels: ready-to-commit
>             Fix For: Future, 1.11.0
>
>
> It seems there a way to force Drill to store CTAS generated parquet file as a 
> single block when using HDFS. Java HDFS API allows to do that, files could be 
> created with the Parquet block-size set in a session or system config.
> Since it is ideal  to have single parquet file per hdfs block.
> Here is the HDFS API that allow to do that:
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> Drill uses the hadoop ParquetFileWriter 
> (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java).
> This is where the file creation occurs so it might be tricky.
> However, ParquetRecordWriter.java 
> (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
>  in Drill creates the ParquetFileWriter with an hadoop configuration object.
> something to explore: Could the block size be set as a property within the 
> Configuration object before passing it to ParquetFileWriter constructor?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

Reply via email to