Hexiaoqiao commented on code in PR #6384:
URL: https://github.com/apache/hudi/pull/6384#discussion_r946453375


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -64,19 +64,17 @@
 import java.util.function.Function;
 import java.util.function.Predicate;
 import java.util.regex.Matcher;
-import java.util.regex.Pattern;
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.common.model.HoodieLogFile.LOG_FILE_PATTERN;
+
 /**
  * Utility functions related to accessing the file storage.
  */
 public class FSUtils {
 
   private static final Logger LOG = LogManager.getLogger(FSUtils.class);
-  // Log files are of this pattern - 
.b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1
-  private static final Pattern LOG_FILE_PATTERN =
-      
Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");
   private static final String LOG_FILE_PREFIX = ".";

Review Comment:
   @danny0405 Hi, the original case as 
https://issues.apache.org/jira/browse/HUDI-4613 description. I would like to 
give simple summary here.
   a. Build FileGroups 
(https://github.com/apache/hudi/blob/af9f09047d7324269cb00a7cb2547a6461c178f6/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L166),
 is invoke frequently by both write and read logic flow, which groups basefile 
and logfiles to the proper fileslice.
   b. FileSlice use `TreeSet` to manage Logfiles.
   c. Consider that add one log to LogFiles using 
org.apache.hudi.common.model.FileSlice#addLogFile, it need to sort first which 
is decided by 
org.apache.hudi.common.model.HoodieLogFile.LogFileComparator#compare.
   d. But it will execute 6 times RegEx resolve logic to add one logFile when 
meets the bad case now.
   e. As we all know, RegEx resolver does not have good performance. As 
@ThinkerLei said above, it took over 30,000 ms to resolve 60,000 files and 
group them.
   f. Back to this PR, it proposes to resolve all information when init 
HoodieLogFile object to reduce resolve times later, the core improvement at 
here 
[hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.java](https://github.com/apache/hudi/pull/6384/files/be94781340ba821d5de240c1a4eed249efa2e0db#diff-e9106a31964518eb378948a0446d3e602cca7f2de09514203cd3bb9e7ce22ebb).
 I think it is one good demo about "trade space for time".
   Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to