[
https://issues.apache.org/jira/browse/PARQUET-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147229#comment-14147229
]
Tongjie Chen commented on PARQUET-100:
--------------------------------------
In Hive, things are a bit more complicated.
1) Parquet hive's MapredParquetOutputFormat is using old mapred API
(org.apache.hadoop.mapred); ParquetOutputCommitter is using new mapreduce API
(org.apache.hadoop.mapreduce.lib.output). So some code duplicate is inevitable.
2) in Hive, for partitioned tables, generating _metadata file cannot be done in
commitTask method since in dynamic partitioning, one partition's file could be
generated by many tasks, and one task can write to many partitions. If we do
it in commitJob, that basically just move the slow part from read to write.
(commitJob need to read all files and merge them, which could be very slow).
3) hive "insert into" statement would make existing _metadata stale unless
regenerating it.
> provide an option in parquet-pig to avoid reading footers in client side
> ------------------------------------------------------------------------
>
> Key: PARQUET-100
> URL: https://issues.apache.org/jira/browse/PARQUET-100
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: parquet-mr_1.6.0
> Reporter: Tongjie Chen
>
> Parquet Pig reads footer in client side, to calculate splits and retrieve
> schema etc.
> In HCatalog environment, if there are large number of files generated by
> Hive, Parquet-Pig will spend significant chunk of time processing those
> footers in client side (before job is submitted to cluster).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)