[
https://issues.apache.org/jira/browse/DRILL-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068670#comment-16068670
]
Arina Ielchiieva commented on DRILL-4720:
-----------------------------------------
To retrieve sub-partition list {{FileSystemSchema}} uses the following code:
{code}
@Override
public Iterable<String> getSubPartitions(String table,
List<String> partitionColumns,
List<String> partitionValues
) throws PartitionNotFoundException
{
List<FileStatus> fileStatuses;
try {
fileStatuses = defaultSchema.getFS().list(false, new
Path(defaultSchema.getDefaultLocation(), table));
} catch (IOException e) {
throw new PartitionNotFoundException("Error finding partitions for
table " + table, e);
}
return new SubDirectoryList(fileStatuses);
}
{code}
{{DrillFileSystem.list(boolean recursive, Path... paths)}} is used to return
list of file statuses.
[This method behavior is not obvious
though|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/DrillFileSystem.java#L750].
If it is called with recursive flag set to false, it will return all
directories and files in given path.
If it is called with recursive flag set to true it will return only the list of
files in given path including nested files and also will filter out all files
and directories that are excluded by Drill file system. When reading data from
table, [Drill excluded all files and directories that start with dot or
underscore|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/DrillPathFilter.java].
> MINDIR() and IMINDIR() functions return no results with metadata cache
> ----------------------------------------------------------------------
>
> Key: DRILL-4720
> URL: https://issues.apache.org/jira/browse/DRILL-4720
> Project: Apache Drill
> Issue Type: Bug
> Components: Functions - Drill
> Affects Versions: 1.7.0
> Reporter: Krystal
> Assignee: Arina Ielchiieva
>
> Parquet directories with meta data cache return 0 rows for MINDIR and IMINDIR
> functions.
> hadoop fs -ls /tmp/querylogs_4
> Found 6 items
> -rwxr-xr-x 3 mapr mapr 15406 2016-06-13 10:18
> /tmp/querylogs_4/.drill.parquet_metadata
> drwxr-xr-x - root root 4 2016-06-13 10:18 /tmp/querylogs_4/1985
> drwxr-xr-x - root root 3 2016-06-13 10:18 /tmp/querylogs_4/1999
> drwxr-xr-x - root root 3 2016-06-13 10:18 /tmp/querylogs_4/2005
> drwxr-xr-x - root root 4 2016-06-13 10:18 /tmp/querylogs_4/2014
> drwxr-xr-x - root root 6 2016-06-13 10:18 /tmp/querylogs_4/2016
> hadoop fs -ls /tmp/querylogs_4/1985
> Found 4 items
> -rwxr-xr-x 3 mapr mapr 3634 2016-06-13 10:18
> /tmp/querylogs_4/1985/.drill.parquet_metadata
> drwxr-xr-x - root root 2 2016-06-13 10:18 /tmp/querylogs_4/1985/Feb
> drwxr-xr-x - root root 2 2016-06-13 10:18 /tmp/querylogs_4/1985/apr
> drwxr-xr-x - root root 2 2016-06-13 10:18
> /tmp/querylogs_4/1985/jan
> SELECT * FROM `dfs.tmp`.`querylogs_4` WHERE dir0 =
> MINDIR('dfs.tmp','querylogs_4');
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> | voter_id | name | age | registration | contributions | voterzone |
> date_time | dir0 | dir1 | dir2 |
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> No rows selected (0.803 seconds)
> If the meta cache is removed, expected data is returned.
> Here is the physical plan:
> {code}
> 00-00 Screen : rowType = RecordType(ANY *): rowcount = 3.75, cumulative
> cost = {54.125 rows, 169.125 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
> 664191
> 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 3.75,
> cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory},
> id = 664190
> 00-02 Project(T51¦¦*=[$0]) : rowType = RecordType(ANY T51¦¦*):
> rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 664189
> 00-03 SelectionVectorRemover : rowType = RecordType(ANY T51¦¦*, ANY
> dir0): rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io,
> 0.0 network, 0.0 memory}, id = 664188
> 00-04 Filter(condition=[=($1, '.drill.parquet_metadata')]) :
> rowType = RecordType(ANY T51¦¦*, ANY dir0): rowcount = 3.75, cumulative cost
> = {50.0 rows, 165.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664187
> 00-05 Project(T51¦¦*=[$0], dir0=[$1]) : rowType = RecordType(ANY
> T51¦¦*, ANY dir0): rowcount = 25.0, cumulative cost = {25.0 rows, 50.0 cpu,
> 0.0 io, 0.0 network, 0.0 memory}, id = 664186
> 00-06 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath
> [path=/tmp/querylogs_4/2005/May/voter25.parquet/0_0_0.parquet]],
> selectionRoot=/tmp/querylogs_4, numFiles=1, usedMetadataFile=true,
> columns=[`*`]]]) : rowType = (DrillRecordRow[*, dir0]): rowcount = 25.0,
> cumulative cost = {25.0 rows, 50.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
> = 664185
> {code}
> Here is the plan for the same query against the same directory structure
> without meta data cache:
> {code}
> 00-00 Screen : rowType = RecordType(ANY *): rowcount = 75.0, cumulative
> cost = {82.5 rows, 157.5 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664312
> 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0,
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
> = 664311
> 00-02 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0,
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
> = 664310
> 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
> [path=maprfs:///tmp/querylogs_1/1985/Feb/voter10.parquet/0_0_0.parquet],
> ReadEntryWithPath
> [path=maprfs:///tmp/querylogs_1/1985/jan/voter5.parquet/0_0_0.parquet],
> ReadEntryWithPath
> [path=maprfs:///tmp/querylogs_1/1985/apr/voter65.parquet/0_0_0.parquet]],
> selectionRoot=maprfs:/tmp/querylogs_1, numFiles=3, usedMetadataFile=false,
> columns=[`*`]]]) : rowType = (DrillRecordRow[*, dir0]): rowcount = 75.0,
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
> = 664309
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)