bhasudha commented on a change in pull request #689: [HUDI-25] Optimize 
HoodieInputFormat.listStatus for faster Hive Incremental queries
URL: https://github.com/apache/incubator-hudi/pull/689#discussion_r291478871
 
 

 ##########
 File path: 
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
 ##########
 @@ -143,29 +137,32 @@ private HoodieDataFile checkFileStatus(HoodieDataFile 
dataFile) throws IOExcepti
     }
   }
 
-  private Map<HoodieTableMetaClient, List<FileStatus>> 
groupFileStatus(FileStatus[] fileStatuses)
-      throws IOException {
+  /**
+   * Takes in a list of filesStatus and a list of table metadatas. Groups the 
files status list
+   * based on given table metadata.
+   * @param fileStatuses
+   * @param metaClientList
+   * @return
+   * @throws IOException
+   */
+  private Map<HoodieTableMetaClient, List<FileStatus>> 
groupFileStatusForNonIncrementalPaths(
+      FileStatus[] fileStatuses, Collection<HoodieTableMetaClient> 
metaClientList) throws IOException {
     // This assumes the paths for different tables are grouped together
     Map<HoodieTableMetaClient, List<FileStatus>> grouped = new HashMap<>();
     HoodieTableMetaClient metadata = null;
-    String nonHoodieBasePath = null;
     for (FileStatus status : fileStatuses) {
       if (!status.getPath().getName().endsWith(".parquet")) {
         //FIXME(vc): skip non parquet files for now. This wont be needed once 
log file name start
         // with "."
         continue;
       }
-      if ((metadata == null && nonHoodieBasePath == null) || (metadata == null 
&& !status.getPath()
-          .toString().contains(nonHoodieBasePath)) || (metadata != null && 
!status.getPath()
-          .toString().contains(metadata.getBasePath()))) {
-        try {
-          metadata = getTableMetaClient(status.getPath().getFileSystem(conf),
-              status.getPath().getParent());
-          nonHoodieBasePath = null;
-        } catch (DatasetNotFoundException | InvalidDatasetException e) {
-          LOG.info("Handling a non-hoodie path " + status.getPath());
-          metadata = null;
-          nonHoodieBasePath = status.getPath().getParent().toString();
+      if ((metadata == null) || 
(!status.getPath().toString().contains(metadata.getBasePath()))) {
+        for (HoodieTableMetaClient metaClient : metaClientList) {
+          if (!status.getPath().toString().contains(metaClient.getBasePath())) 
{
 
 Review comment:
   Yeah. This method groups FileStatuses belonging to same table into one entry 
in a HashMap.  I believe there is not much room for improvement apart from your 
suggestion. Please let me know if you can think of any.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to