xiarixiaoyao commented on a change in pull request #3330:
URL: https://github.com/apache/hudi/pull/3330#discussion_r684762348
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##########
@@ -365,6 +385,13 @@ private void
completeClustering(HoodieReplaceCommitMetadata metadata, JavaRDD<Wr
}
finalizeWrite(table, clusteringCommitTime, writeStats);
try {
+ // try to save statistics info to hudi
+ if (config.getOptimizeEnableDataSkipping() &&
!config.getOptimizeSortColumns().isEmpty()) {
+ String basePath = table.getMetaClient().getBasePath();
Review comment:
@satishkotha
no, only optimize operation will produce statistics。
and we save those statistics info to the path ./hoodie/.index with
commitTime as name
**_// /tmp/mytest/.hoodie/.index
20210808123645
//
20210808123645 is the index table name._**
if the indexPath has no index table, we will save statistis info direclty as
parquet table with commitTime as it's name
if the indexPath has old index table, we will update the old index table by
statistis info with full out join method. then save the updated info into a new
parquet table with commitTime as it's name
In the hoodieFileIndex, we do data skip by use lastest index table. Filters
from query statement will be convert to the filter for index table, choose the
filter files from index table, than do filter 。
of course this method is simple, but it's enough to do data skip for
z-order/hilbert optimitze. RFC-27 is a surprising feature for data skip,
however this feature is not yet compeled. Once RFC-27 has been completed , i
will do adaptation。
If possible, I tread to participate in the development of rfc-27
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]