[ 
https://issues.apache.org/jira/browse/DRILL-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423562#comment-15423562
 ] 

ASF GitHub Bot commented on DRILL-4846:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/569#discussion_r75037246
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
    @@ -470,41 +483,38 @@ private ParquetTableMetadataBase readBlockMeta(String 
path) throws IOException {
         mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, 
false);
         FSDataInputStream is = fs.open(p);
     
    -    ParquetTableMetadataBase parquetTableMetadata = mapper.readValue(is, 
ParquetTableMetadataBase.class);
    -    logger.info("Took {} ms to read metadata from cache file", 
timer.elapsed(TimeUnit.MILLISECONDS));
    -    timer.stop();
    -    if (tableModified(parquetTableMetadata, p)) {
    -      parquetTableMetadata =
    -          
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(p.getParent()).toString())).getLeft();
    -    }
    -    return parquetTableMetadata;
    -  }
    -
    -  private ParquetTableMetadataDirs readMetadataDirs(String path) throws 
IOException {
    -    Stopwatch timer = Stopwatch.createStarted();
    -    Path p = new Path(path);
    -    ObjectMapper mapper = new ObjectMapper();
    -
    -    final SimpleModule serialModule = new SimpleModule();
    -    serialModule.addDeserializer(SchemaPath.class, new SchemaPath.De());
    -
    -    AfterburnerModule module = new AfterburnerModule();
    -    module.setUseOptimizedBeanDeserializer(true);
    +    boolean alreadyCheckedModification = false;
    +    boolean newMetadata = false;
     
    -    mapper.registerModule(serialModule);
    -    mapper.registerModule(module);
    -    mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, 
false);
    -    FSDataInputStream is = fs.open(p);
    +    if (metaContext != null) {
    +      alreadyCheckedModification = 
metaContext.getStatus(parentDir.toString());
    +    }
     
    -    ParquetTableMetadataDirs parquetTableMetadataDirs = 
mapper.readValue(is, ParquetTableMetadataDirs.class);
    -    logger.info("Took {} ms to read directories from directory cache 
file", timer.elapsed(TimeUnit.MILLISECONDS));
    -    timer.stop();
    +    if (dirsOnly) {
    +      parquetTableMetadataDirs = mapper.readValue(is, 
ParquetTableMetadataDirs.class);
    +      logger.info("Took {} ms to read directories from directory cache 
file", timer.elapsed(TimeUnit.MILLISECONDS));
    +      timer.stop();
    +      if (!alreadyCheckedModification && 
tableModified(parquetTableMetadataDirs.getDirectories(), p, parentDir, 
metaContext)) {
    --- End diff --
    
    tableModified compares the directory's modification time against metadata 
cache file's modification time. 
    
    In 1st call,  the metadata file is .drill.parquet_metadata_directories. And 
we keep track of the directory compared. In 2nd call, supposedly we should 
compare .drill.parquet_metadata against directory. But MetaContext keeps track 
the comparison based on .drill.parquet_metadata_directories.  Will this cause 
any potential problem, since .drill.parquet_metadata and 
.drill.parquet_metadata_directories are different files?
     
    



> Eliminate extra operations during metadata cache pruning
> --------------------------------------------------------
>
>                 Key: DRILL-4846
>                 URL: https://issues.apache.org/jira/browse/DRILL-4846
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.7.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.8.0
>
>
> While doing performance testing for DRILL-4530 using a new data set and 
> queries, we found two potential performance issues: (a) the metadata cache 
> was being read twice in some cases and (b) the checking for directory 
> modification time was being done twice, once as part of the first phase of 
> directory-based pruning and subsequently after the second phase pruning.   
> This check gets expensive for large number of directories.   Creating this 
> JIRA to track fixes for these issues. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to