Hi Alexey,

You should see in the logs a locality measure like NODE_LOCAL,
PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
on them and you're reading out of HDFS, then you should be seeing almost
all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
think the data is local and does remote reads which really kills
performance.

Hope that helps!
Andrew

On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk <
alexey.romanc...@gmail.com> wrote:

> Hello again spark users and developers!
>
> I have standalone spark cluster (1.1.0) and spark sql running on it. My
> cluster consists of 4 datanodes and replication factor of files is 3.
>
> I use thrift server to access spark sql and have 1 table with 30+
> partitions. When I run query on whole table (something simple like select
> count(*) from t) spark produces a lot of network activity filling all
> available 1gb link. Looks like spark sent data by network instead of local
> reading.
>
> Is it any way to log which blocks were accessed locally and which are not?
>
> Thanks!
>

Reply via email to