[ 
https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729187#comment-16729187
 ] 

Tim Armstrong commented on IMPALA-7214:
---------------------------------------

I started to have a look at this, and I think it's not just a terminology 
issue, there's a pervasive assumption in a lot of the docs of a particular 
topology - having a coordinator+executor colocated with each HDFS DataNode. 
E.g. an example is this sentence: "You can submit a query to the Impala daemon 
running on any DataNode", which is outdated both because of the 
coordinator/executor separate and because of decouple compute clusters.

So I think we need to first update docs to reflect the coordinator/executor 
separation. Maybe they need to be defined in topics/impala_components.xml. I 
think that introductory material also needs to introduce the traditional 
colocated HDFS storage/compute model, where each executor runs on the same host 
as a DataNode, but also mention that other deployment models are possible (like 
a compute cluster that reads remotely from HDFS, S3, ADLS, etc).

Then there are a bunch of instances of the terminology that I think are really 
referring to an Impala daemon, or are unnecessary. e.g.  
* "For information about establishing a connection to a DataNode running the 
<codeph>impalad</codeph> daemon.." should probably  read "For information about 
establishing a connection to a coordinator Impala Daemon..."
* "Where practical, co-locate the tablet servers on the same hosts as the 
DataNodes, although that is not required."->  "Where practical, co-locate the 
tablet servers on the same hosts as the Impala Daemons, although that is not 
required."
* "If necessary, specify one of the following configuration options when 
starting the <cmdname>impalad</cmdname> daemon on each DataNode:" -> "If 
necessary, specify one of the following configuration options when starting the 
<cmdname>impalad</cmdname> daemon:"
* In impala_order_by.xml, I think each instance of DataNode actually should be 
"Impala Daemon" or, probably more precisely,  "Executor Impala Daemon"
* "For some operations, such as joins and combining intermediate results into a 
final result set, data is transmitted across the network from one DataNode to 
another." -> this should be "Impala Daemon"

There are a bunch of instances of "DataNode" that are referring to the HDFS 
process or service, or the host that is running the HDFS process, which should 
be ok to leave for now (there are some that are a little unclear).

I found one instance were it was referring to the host running the Impala 
Daemon: "Increase the overall memory capacity of each DataNode at the hardware 
level." I'd rephrase that as "Add more memory to the hosts running Impala 
Daemons".

[~arodoni_cloudera] Maybe we can fix up the instances where it's definitely 
referring to the Impala daemon, then do another pass to see how it looks?

[~philip][~joemcdonnell] FYI, just an example of some of the things we might 
need to update in docs for remote reads.

> Update Impala docs to reflect coordinator/executor separation and decoupling 
> from DataNodes.
> --------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-7214
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7214
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Docs
>    Affects Versions: Impala 2.12.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>
> The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I 
> think this stems from the original deployment practice of always colocating 
> Impala daemons with HDFS datanodes so that HDFS data could always be read 
> from a local DataNode. 
> I'm a bit pedantic so the conflation feels wrong to me regardless, but I 
> think this will become increasingly confusing as alternative deployments 
> without colocated HDFS DataNodes become more common (e.g. running against S3, 
> running with a separate HDFS service).
> E.g. picking an example at random:
> {noformat}
>         In Impala 1.4.0 and higher, the <codeph>LIMIT</codeph> clause is now 
> optional (rather than required) for
>         queries that use the <codeph>ORDER BY</codeph> clause. Impala 
> automatically uses a temporary disk work area
>         to perform the sort if the sort operation would otherwise exceed the 
> Impala memory limit for a particular
>         DataNode.
> {noformat}
> This is wrong because the memory limit is for an Impala daemon, which is the 
> process that does the actual sorting. So here I think it should be "Impala 
> daemon" instead of "DataNode".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to