vinothchandar commented on a change in pull request #689: [HUDI-25] Optimize 
HoodieInputFormat.listStatus for faster Hive Incremental queries
URL: https://github.com/apache/incubator-hudi/pull/689#discussion_r288608448
 
 

 ##########
 File path: 
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
 ##########
 @@ -143,29 +137,32 @@ private HoodieDataFile checkFileStatus(HoodieDataFile 
dataFile) throws IOExcepti
     }
   }
 
-  private Map<HoodieTableMetaClient, List<FileStatus>> 
groupFileStatus(FileStatus[] fileStatuses)
-      throws IOException {
+  /**
+   * Takes in a list of filesStatus and a list of table metadatas. Groups the 
files status list
+   * based on given table metadata.
+   * @param fileStatuses
+   * @param metaClientList
+   * @return
+   * @throws IOException
+   */
+  private Map<HoodieTableMetaClient, List<FileStatus>> 
groupFileStatusForNonIncrementalPaths(
+      FileStatus[] fileStatuses, Collection<HoodieTableMetaClient> 
metaClientList) throws IOException {
     // This assumes the paths for different tables are grouped together
     Map<HoodieTableMetaClient, List<FileStatus>> grouped = new HashMap<>();
     HoodieTableMetaClient metadata = null;
-    String nonHoodieBasePath = null;
     for (FileStatus status : fileStatuses) {
       if (!status.getPath().getName().endsWith(".parquet")) {
         //FIXME(vc): skip non parquet files for now. This wont be needed once 
log file name start
         // with "."
         continue;
       }
-      if ((metadata == null && nonHoodieBasePath == null) || (metadata == null 
&& !status.getPath()
-          .toString().contains(nonHoodieBasePath)) || (metadata != null && 
!status.getPath()
-          .toString().contains(metadata.getBasePath()))) {
-        try {
-          metadata = getTableMetaClient(status.getPath().getFileSystem(conf),
-              status.getPath().getParent());
-          nonHoodieBasePath = null;
-        } catch (DatasetNotFoundException | InvalidDatasetException e) {
-          LOG.info("Handling a non-hoodie path " + status.getPath());
-          metadata = null;
-          nonHoodieBasePath = status.getPath().getParent().toString();
+      if ((metadata == null) || 
(!status.getPath().toString().contains(metadata.getBasePath()))) {
+        for (HoodieTableMetaClient metaClient : metaClientList) {
+          if (!status.getPath().toString().contains(metaClient.getBasePath())) 
{
 
 Review comment:
   I know this predates you or this diff. but since you are touching it anyway, 
can we restructure this block to be more readable
   
   ```
   if (status.getPath().toString().contains(metaClient.getBasePath())) {
     metadata = metaClient;
     break;
   }
   ```
   
   also do you think we can simplify `(metadata == null) || 
(!status.getPath().toString().contains(metadata.getBasePath())))`. All this 
method now does is to check if status matches any of the base paths we already 
classified..?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to