Hi Alexey, You should see in the logs a locality measure like NODE_LOCAL, PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node on them and you're reading out of HDFS, then you should be seeing almost all NODE_LOCAL accesses. One cause I've seen for mismatches is if Spark uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't think the data is local and does remote reads which really kills performance.
Hope that helps! Andrew On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk < alexey.romanc...@gmail.com> wrote: > Hello again spark users and developers! > > I have standalone spark cluster (1.1.0) and spark sql running on it. My > cluster consists of 4 datanodes and replication factor of files is 3. > > I use thrift server to access spark sql and have 1 table with 30+ > partitions. When I run query on whole table (something simple like select > count(*) from t) spark produces a lot of network activity filling all > available 1gb link. Looks like spark sent data by network instead of local > reading. > > Is it any way to log which blocks were accessed locally and which are not? > > Thanks! >