cdmikechen opened a new pull request #1122: [HUDI-29]: Support hudi COW table to use *ANALYZE TABLE table_name COMMPUTE STATISTICS* to get table current rows URL: https://github.com/apache/incubator-hudi/pull/1122 link https://issues.apache.org/jira/projects/HUDI/issues/HUDI-29 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request if use `ANALYZE TABLE table_name COMMPUTE STATISTICS` to get hudi table rows, hive will collect all parquet file in table path. Now let hudi table to identify which files are the latest Hudi files, so that hive can get a right result for stats. ## Brief change log ```shell hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++++++- hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/NoneParquetRecordReaderWrapper.java | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 124 insertions(+), 1 deletion(-) create mode 100644 hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/NoneParquetRecordReaderWrapper.java ``` ## Verify this pull request have test in `org.apache.hudi.hadoop.TestHoodieInputFormat` and `mvn clean package -DskipTests -DskipITs `. I had a hudi COW table with 750 rows and updated some times. ```shell hudi->connect --path /hive/warehouse/lims.db/lims_method 19/12/23 10:09:10 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /hive/warehouse/lims.db/lims_method 19/12/23 10:09:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/12/23 10:09:11 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://bdcluster1:9000/], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1269021288_12, ugi=hdfs (auth:SIMPLE)]]] 19/12/23 10:09:11 INFO table.HoodieTableConfig: Loading dataset properties from /hive/warehouse/lims.db/lims_method/.hoodie/hoodie.properties 19/12/23 10:09:11 INFO table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=org.apache.hudi.common.model.TimelineLayoutVersion@20) from /hive/warehouse/lims.db/lims_method Metadata for table lims_method loaded hudi:lims_method->commits show 19/12/23 10:09:22 INFO timeline.HoodieActiveTimeline: Loaded instants [[20190801100644__clean__COMPLETED], [20190801100644__commit__COMPLETED], [20190807152831__clean__COMPLETED], [20190807152831__commit__COMPLETED], [20190807153023__clean__COMPLETED], [20190807153023__commit__COMPLETED], [20190808160401__clean__COMPLETED], [20190808160401__commit__COMPLETED], [20190924090925__clean__COMPLETED], [20190924090925__commit__COMPLETED], [20190924092639__clean__COMPLETED], [20190924092639__commit__COMPLETED], [20191104150324__clean__COMPLETED], [20191104150324__commit__COMPLETED], [20191104150629__clean__COMPLETED], [20191104150629__commit__COMPLETED], [20191104165039__clean__COMPLETED], [20191104165039__commit__COMPLETED]] ╔════════════════╤═════════════════════╤═══════════════════╤═════════════════════╤══════════════════════════╤═══════════════════════╤══════════════════════════════╤══════════════╗ ║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║ ╠════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣ ║ 20191104165039 │ 457.4 KB │ 0 │ 1 │ 1 │ 750 │ 1 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20191104150629 │ 457.4 KB │ 0 │ 1 │ 1 │ 750 │ 1 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20191104150324 │ 457.4 KB │ 0 │ 1 │ 1 │ 750 │ 1 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20190924092639 │ 457.3 KB │ 0 │ 1 │ 1 │ 750 │ 2 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20190924090925 │ 457.3 KB │ 0 │ 1 │ 1 │ 750 │ 1 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20190808160401 │ 457.2 KB │ 0 │ 1 │ 1 │ 750 │ 1 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20190807153023 │ 457.1 KB │ 0 │ 1 │ 1 │ 750 │ 1 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20190807152831 │ 457.1 KB │ 0 │ 1 │ 1 │ 750 │ 1 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20190801100644 │ 457.2 KB │ 1 │ 0 │ 1 │ 750 │ 0 │ 0 ║ ╚════════════════╧═════════════════════╧═══════════════════╧═════════════════════╧══════════════════════════╧═══════════════════════╧══════════════════════════════╧══════════════╝ ``` I use `ANALYZE TABLE` command in hive(it is based on Tez and Mr, I both test before, now in Tez) beeline and test `select count(*)` ``` 0: jdbc:hive2://localhost:10000> ANALYZE TABLE lims.lims_method COMPUTE STATISTICS; No rows affected (4.569 seconds) 0: jdbc:hive2://localhost:10000> select count(1) from lims.lims_method; +------+ | _c0 | +------+ | 750 | +------+ 1 row selected (0.632 seconds) ``` ## Committer checklist - [x] Has a corresponding JIRA in PR title & commit - [x] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
