[
https://issues.apache.org/jira/browse/DRILL-4706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637850#comment-15637850
]
ASF GitHub Bot commented on DRILL-4706:
---------------------------------------
Github user ppadma commented on the issue:
https://github.com/apache/drill/pull/639
Parallelization logic is affected for following reasons:
Depending upon how many rowGroups to scan on a node (based on locality
information) i.e. how much work the node has to do, we want to adjust the
number of fragments on the node (constrained to usual global and per node
limits).
We do not want to schedule fragment(s) on a node which do not have data.
Because we want pure locality, we may have fewer fragments doing more work.
> Fragment planning causes Drillbits to read remote chunks when local copies
> are available
> ----------------------------------------------------------------------------------------
>
> Key: DRILL-4706
> URL: https://issues.apache.org/jira/browse/DRILL-4706
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 1.6.0
> Environment: CentOS, RHEL
> Reporter: Kunal Khatua
> Assignee: Sorabh Hamirwasia
> Labels: performance, planning
>
> When a table (datasize=70GB) of 160 parquet files (each having a single
> rowgroup and fitting within one chunk) is available on a 10-node setup with
> replication=3 ; a pure data scan query causes about 2% of the data to be read
> remotely.
> Even with the creation of metadata cache, the planner is selecting a
> sub-optimal plan of executing the SCAN fragments such that some of the data
> is served from a remote server.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)