[
https://issues.apache.org/jira/browse/DRILL-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373801#comment-15373801
]
ASF GitHub Bot commented on DRILL-4530:
---------------------------------------
Github user jinfengni commented on a diff in the pull request:
https://github.com/apache/drill/pull/519#discussion_r70532442
--- Diff:
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
---
@@ -211,9 +217,76 @@ public void testNoSupportedError() throws Exception {
.go();
}
+ @Test // DRILL-4530
+ public void testDrill4530_1() throws Exception {
+ // create metadata cache
+ test(String.format("refresh table metadata dfs_test.`%s/%s`",
getDfsTestTmpSchemaLocation(), tableName2));
+ checkForMetadataFile(tableName2);
+
+ // run query and check correctness
+ String query1 = String.format("select dir0, dir1, o_custkey,
o_orderdate from dfs_test.`%s/%s` " +
+ " where dir0=1995 and dir1='Q3'",
+ getDfsTestTmpSchemaLocation(), tableName2);
+ int expectedRowCount = 20;
+ int expectedNumFiles = 2;
+
+ int actualRowCount = testSql(query1);
+ assertEquals(expectedRowCount, actualRowCount);
+ String numFilesPattern = "numFiles=" + expectedNumFiles;
+ String usedMetaPattern = "usedMetadataFile=true";
+ String cacheFileRootPattern = String.format("%s/%s/1995/Q3",
getDfsTestTmpSchemaLocation(), tableName2);
--- End diff --
The verification of cacheFileRootPattern probably need put "cacheFileRoot="
as the prefix. Otherwise, the list of files in GroupScan will always find a
match for cacheFileRoot, right?
> Improve metadata cache performance for queries with single partition
> ---------------------------------------------------------------------
>
> Key: DRILL-4530
> URL: https://issues.apache.org/jira/browse/DRILL-4530
> Project: Apache Drill
> Issue Type: Improvement
> Components: Query Planning & Optimization
> Affects Versions: 1.6.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Fix For: Future
>
>
> Consider two types of queries which are run with Parquet metadata caching:
> {noformat}
> query 1:
> SELECT col FROM `A/B/C`;
> query 2:
> SELECT col FROM `A` WHERE dir0 = 'B' AND dir1 = 'C';
> {noformat}
> For a certain dataset, the query1 elapsed time is 1 sec whereas query2
> elapsed time is 9 sec even though both are accessing the same amount of data.
> The user expectation is that they should perform roughly the same. The main
> difference comes from reading the bigger metadata cache file at the root
> level 'A' for query2 and then applying the partitioning filter. query1 reads
> a much smaller metadata cache file at the subdirectory level.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)