[ 
https://issues.apache.org/jira/browse/PARQUET-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800659#comment-16800659
 ] 

Luis Fernando Kauer commented on PARQUET-1549:
----------------------------------------------

I am interested on this fix too.
The recommendation to use only one row group per file also applies to other 
softwares, like Dremio and Drill.
https://docs.dremio.com/advanced-administration/parquet-files.html
https://drill.apache.org/docs/parquet-format/
I think there should be at least the option available to users that need this 
feature.

> Option for one block per file in MapReduce output
> -------------------------------------------------
>
>                 Key: PARQUET-1549
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1549
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>    Affects Versions: 1.10.0
>            Reporter: Gustavo Figueiredo
>            Priority: Minor
>
> When we create PARQUET files using a MapReduce application with current 
> ParquetOutputFormat implementation, we don't have any option to reliably 
> limit the number of blocks (row groups) we want to generate per file.
> The implemented configuration option 'parquet.block.size' 
> (ParquetOutputFormat.BLOCK_SIZE) refers to the amount of data that goes into 
> one block of data, but there are no guarantees that this will be the only 
> block in a file. If one sets this configuration option to a very high value, 
> it's likely there will be a single block per PARQUET file. However, this 
> approach might lead to undesirably big files, so this would not be a good 
> option in some scenarios.
> This behaviour can't be achieved by the client's 'mapper' either. Although 
> there are some helpfull classes in Hadoop API, such as 'MultipleOutputs', we 
> don't have enough information available at 'mapper' code in order to have 
> this kind of control, unless one uses unsafe 'hacks' to gather information 
> from private fields.
> By instance, suppose we have an ETL application that loads data from HBASE 
> regions (might be one or more MAPs per region) and produces PARQUET files to 
> be consumed in IMPALA tables (might be one or more PARQUET files per MAP 
> task). To simplify, let's say there is no 'REDUCE' task in this application.
> For concreteness, lets say one could use for such job 
> 'org.apache.hadoop.hbase.mapreduce.TableInputFormat' as input and 
> 'org.apache.parquet.hadoop.ParquetOutputFormat' as output. 
> Following the guidelines for maximum query performance in Impala queries in 
> HADOOP ecosystem, each PARQUET file should be approximately equal in size to 
> a HDFS block and there should be only one single block of data (row group) in 
> each of them (see 
> https://impala.apache.org/docs/build/html/topics/impala_perf_cookbook.html#perf_cookbook__perf_cookbook_parquet_block_size).
> Currently we are only able to do this by trial and error with different 
> configuration options.
> It would be nice to have a new boolean configuration option (lets call it 
> 'parquet.split.file.per.block') related to the existing one 
> 'parquet.block.size'. If it's set to false (default value), we would have the 
> current behaviour. If it's to true, we would have one different PARQUET file 
> being generated for each 'block' created, all coming from the same 
> ParquetRecordWriter.
> In doing so, we would only have to worry about tuning the 
> 'parquet.block.size' parameter in order to generate PARQUET files with one 
> single block per file whose size is closer to the configured HDFS block size.
>  
> In order to implement this new feature, we only need to change a few classes 
> in 'org.apache.parquet.hadoop' package, namely:
>  InternalParquetRecordWriter
>  ParquetFileWriter
>  ParquetOutputFormat
>  ParquetRecordWriter
> Briefly, these are the changes needed:
>  InternalParquetRecordWriter:
>  The field 'ParquetFileWriter parquetFileWriter' should not be 'final' 
> anymore, since we want to be able to change this throughout the task.
>  The method 'checkBlockSizeReached' should call a new function 'startNewFile' 
> just after a call to 'flushRowGroupToStore'.
>  The new method 'startNewFile' should have all the logic for closing the 
> current file and starting a new one at the same location with a proper 
> filename.
>  
>  ParquetFileWriter
>  The constructor argument 'OutputFile file' should be persisted as a new 
> member field and made available by a new public method. This information is 
> usefull for the 'startNewFile' implementation mentioned above.
>  The field 'MessageType schema' should be available by a new public method. 
> This information is also usefull for the 'startNewFile' implementation.
>  
>  ParquetOutputFormat
>  The existing private method 'getMaxPaddingSize' should be made 'public' or 
> at least 'package protected'. This information is usefull for the 
> 'startNewFile' implementation mentioned above.
>  The new configuration option 'parquet.split.file.per.block' should be 
> specified here like the other ones. The new behaviour in 
> 'InternalParquetRecordWriter' is conditioned on this configuration option.
>  
>  ParquetRecordWriter
>  Just pass away the configuration option to the internal 
> InternalParquetRecordWriter instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to