[
https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729187#comment-16729187
]
Tim Armstrong commented on IMPALA-7214:
---------------------------------------
I started to have a look at this, and I think it's not just a terminology
issue, there's a pervasive assumption in a lot of the docs of a particular
topology - having a coordinator+executor colocated with each HDFS DataNode.
E.g. an example is this sentence: "You can submit a query to the Impala daemon
running on any DataNode", which is outdated both because of the
coordinator/executor separate and because of decouple compute clusters.
So I think we need to first update docs to reflect the coordinator/executor
separation. Maybe they need to be defined in topics/impala_components.xml. I
think that introductory material also needs to introduce the traditional
colocated HDFS storage/compute model, where each executor runs on the same host
as a DataNode, but also mention that other deployment models are possible (like
a compute cluster that reads remotely from HDFS, S3, ADLS, etc).
Then there are a bunch of instances of the terminology that I think are really
referring to an Impala daemon, or are unnecessary. e.g.
* "For information about establishing a connection to a DataNode running the
<codeph>impalad</codeph> daemon.." should probably read "For information about
establishing a connection to a coordinator Impala Daemon..."
* "Where practical, co-locate the tablet servers on the same hosts as the
DataNodes, although that is not required."-> "Where practical, co-locate the
tablet servers on the same hosts as the Impala Daemons, although that is not
required."
* "If necessary, specify one of the following configuration options when
starting the <cmdname>impalad</cmdname> daemon on each DataNode:" -> "If
necessary, specify one of the following configuration options when starting the
<cmdname>impalad</cmdname> daemon:"
* In impala_order_by.xml, I think each instance of DataNode actually should be
"Impala Daemon" or, probably more precisely, "Executor Impala Daemon"
* "For some operations, such as joins and combining intermediate results into a
final result set, data is transmitted across the network from one DataNode to
another." -> this should be "Impala Daemon"
There are a bunch of instances of "DataNode" that are referring to the HDFS
process or service, or the host that is running the HDFS process, which should
be ok to leave for now (there are some that are a little unclear).
I found one instance were it was referring to the host running the Impala
Daemon: "Increase the overall memory capacity of each DataNode at the hardware
level." I'd rephrase that as "Add more memory to the hosts running Impala
Daemons".
[~arodoni_cloudera] Maybe we can fix up the instances where it's definitely
referring to the Impala daemon, then do another pass to see how it looks?
[~philip][~joemcdonnell] FYI, just an example of some of the things we might
need to update in docs for remote reads.
> Update Impala docs to reflect coordinator/executor separation and decoupling
> from DataNodes.
> --------------------------------------------------------------------------------------------
>
> Key: IMPALA-7214
> URL: https://issues.apache.org/jira/browse/IMPALA-7214
> Project: IMPALA
> Issue Type: Bug
> Components: Docs
> Affects Versions: Impala 2.12.0
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Priority: Major
>
> The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I
> think this stems from the original deployment practice of always colocating
> Impala daemons with HDFS datanodes so that HDFS data could always be read
> from a local DataNode.
> I'm a bit pedantic so the conflation feels wrong to me regardless, but I
> think this will become increasingly confusing as alternative deployments
> without colocated HDFS DataNodes become more common (e.g. running against S3,
> running with a separate HDFS service).
> E.g. picking an example at random:
> {noformat}
> In Impala 1.4.0 and higher, the <codeph>LIMIT</codeph> clause is now
> optional (rather than required) for
> queries that use the <codeph>ORDER BY</codeph> clause. Impala
> automatically uses a temporary disk work area
> to perform the sort if the sort operation would otherwise exceed the
> Impala memory limit for a particular
> DataNode.
> {noformat}
> This is wrong because the memory limit is for an Impala daemon, which is the
> process that does the actual sorting. So here I think it should be "Impala
> daemon" instead of "DataNode".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]