Hexiaoqiao commented on code in PR #6384:
URL: https://github.com/apache/hudi/pull/6384#discussion_r946453375
##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -64,19 +64,17 @@
import java.util.function.Function;
import java.util.function.Predicate;
import java.util.regex.Matcher;
-import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;
+import static org.apache.hudi.common.model.HoodieLogFile.LOG_FILE_PATTERN;
+
/**
* Utility functions related to accessing the file storage.
*/
public class FSUtils {
private static final Logger LOG = LogManager.getLogger(FSUtils.class);
- // Log files are of this pattern -
.b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1
- private static final Pattern LOG_FILE_PATTERN =
-
Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");
private static final String LOG_FILE_PREFIX = ".";
Review Comment:
@danny0405 Hi, the original case as
https://issues.apache.org/jira/browse/HUDI-4613 description. I would like to
give simple summary here.
a. Build FileGroups is invoke frequently by both write and read logic flow,
which groups basefile and logfiles to the proper fileslice.
(https://github.com/apache/hudi/blob/af9f09047d7324269cb00a7cb2547a6461c178f6/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L166)
b. FileSlice use `TreeSet` to manage Logfiles.
c. Consider that add one log to LogFiles using
org.apache.hudi.common.model.FileSlice#addLogFile, it need to sort first which
is decided by
org.apache.hudi.common.model.HoodieLogFile.LogFileComparator#compare.
d. But it will execute 6 times RegEx resolve logic to add one logFile when
meets the bad case now.
e. As we all know, RegEx resolver does not have good performance. As
@ThinkerLei said above, it took over 30,000 ms to resolve 60,000 files and
group them.
f. Back to this PR, it proposes to resolve all information when init
HoodieLogFile object to reduce resolve times later, the core improvement at
here
[hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.java](https://github.com/apache/hudi/pull/6384/files/be94781340ba821d5de240c1a4eed249efa2e0db#diff-e9106a31964518eb378948a0446d3e602cca7f2de09514203cd3bb9e7ce22ebb).
I think it is one good demo about "trade space for time".
Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]