[
https://issues.apache.org/jira/browse/DRILL-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423562#comment-15423562
]
ASF GitHub Bot commented on DRILL-4846:
---------------------------------------
Github user jinfengni commented on a diff in the pull request:
https://github.com/apache/drill/pull/569#discussion_r75037246
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java
---
@@ -470,41 +483,38 @@ private ParquetTableMetadataBase readBlockMeta(String
path) throws IOException {
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES,
false);
FSDataInputStream is = fs.open(p);
- ParquetTableMetadataBase parquetTableMetadata = mapper.readValue(is,
ParquetTableMetadataBase.class);
- logger.info("Took {} ms to read metadata from cache file",
timer.elapsed(TimeUnit.MILLISECONDS));
- timer.stop();
- if (tableModified(parquetTableMetadata, p)) {
- parquetTableMetadata =
-
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(p.getParent()).toString())).getLeft();
- }
- return parquetTableMetadata;
- }
-
- private ParquetTableMetadataDirs readMetadataDirs(String path) throws
IOException {
- Stopwatch timer = Stopwatch.createStarted();
- Path p = new Path(path);
- ObjectMapper mapper = new ObjectMapper();
-
- final SimpleModule serialModule = new SimpleModule();
- serialModule.addDeserializer(SchemaPath.class, new SchemaPath.De());
-
- AfterburnerModule module = new AfterburnerModule();
- module.setUseOptimizedBeanDeserializer(true);
+ boolean alreadyCheckedModification = false;
+ boolean newMetadata = false;
- mapper.registerModule(serialModule);
- mapper.registerModule(module);
- mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES,
false);
- FSDataInputStream is = fs.open(p);
+ if (metaContext != null) {
+ alreadyCheckedModification =
metaContext.getStatus(parentDir.toString());
+ }
- ParquetTableMetadataDirs parquetTableMetadataDirs =
mapper.readValue(is, ParquetTableMetadataDirs.class);
- logger.info("Took {} ms to read directories from directory cache
file", timer.elapsed(TimeUnit.MILLISECONDS));
- timer.stop();
+ if (dirsOnly) {
+ parquetTableMetadataDirs = mapper.readValue(is,
ParquetTableMetadataDirs.class);
+ logger.info("Took {} ms to read directories from directory cache
file", timer.elapsed(TimeUnit.MILLISECONDS));
+ timer.stop();
+ if (!alreadyCheckedModification &&
tableModified(parquetTableMetadataDirs.getDirectories(), p, parentDir,
metaContext)) {
--- End diff --
tableModified compares the directory's modification time against metadata
cache file's modification time.
In 1st call, the metadata file is .drill.parquet_metadata_directories. And
we keep track of the directory compared. In 2nd call, supposedly we should
compare .drill.parquet_metadata against directory. But MetaContext keeps track
the comparison based on .drill.parquet_metadata_directories. Will this cause
any potential problem, since .drill.parquet_metadata and
.drill.parquet_metadata_directories are different files?
> Eliminate extra operations during metadata cache pruning
> --------------------------------------------------------
>
> Key: DRILL-4846
> URL: https://issues.apache.org/jira/browse/DRILL-4846
> Project: Apache Drill
> Issue Type: Bug
> Components: Metadata
> Affects Versions: 1.7.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Fix For: 1.8.0
>
>
> While doing performance testing for DRILL-4530 using a new data set and
> queries, we found two potential performance issues: (a) the metadata cache
> was being read twice in some cases and (b) the checking for directory
> modification time was being done twice, once as part of the first phase of
> directory-based pruning and subsequently after the second phase pruning.
> This check gets expensive for large number of directories. Creating this
> JIRA to track fixes for these issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)