[
https://issues.apache.org/jira/browse/TAJO-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jaehwa Jung updated TAJO-1974:
------------------------------
Affects Version/s: (was: 0.12.0)
> When calculating partitioned table volume, avoid to list partition
> directories.
> -------------------------------------------------------------------------------
>
> Key: TAJO-1974
> URL: https://issues.apache.org/jira/browse/TAJO-1974
> Project: Tajo
> Issue Type: Improvement
> Components: Physical Operator, QueryMaster
> Reporter: Jaehwa Jung
> Assignee: Jaehwa Jung
>
> Currently, after storing the data of partitioned table, Tajo calculates the
> volume of table using listing partition directories. To list directories,
> Tajo use FileSystem::getContentSummary of HDFS generic APIs.
> In case of small to medium-size partition directories, it should not be a
> problem. But in case of large-size partition directories, it should be a
> problem. For example, three years of data, organized into hourly directories,
> results in 26,280 directories. If each directory contains 5 files, this will
> makes a grand total of 131,400 files. It seems to be a medium deal in HDFS,
> but it might results in very poor performance in S3. Thus we need to avoid to
> list partition directories.
> I think we can get the volume of each partition directories in
> PhysicalOperator. If all tasks set the volume of partition, Query doesn’t
> need to list partition directories using HDFS api.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)