Hello Andrew! Thanks for reply. Which logs and on what level should I check? Driver, master or worker?
I found this on master node, but there is only ANY locality requirement. Here it is the driver (spark sql) log - https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers log - https://gist.github.com/13h3r/6e5053cf0dbe33f2aaaa Do you have any idea where to look at? Thanks! On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash <and...@andrewash.com> wrote: > Hi Alexey, > > You should see in the logs a locality measure like NODE_LOCAL, > PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node > on them and you're reading out of HDFS, then you should be seeing almost > all NODE_LOCAL accesses. One cause I've seen for mismatches is if Spark > uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't > think the data is local and does remote reads which really kills > performance. > > Hope that helps! > Andrew > > On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk < > alexey.romanc...@gmail.com> wrote: > >> Hello again spark users and developers! >> >> I have standalone spark cluster (1.1.0) and spark sql running on it. My >> cluster consists of 4 datanodes and replication factor of files is 3. >> >> I use thrift server to access spark sql and have 1 table with 30+ >> partitions. When I run query on whole table (something simple like select >> count(*) from t) spark produces a lot of network activity filling all >> available 1gb link. Looks like spark sent data by network instead of local >> reading. >> >> Is it any way to log which blocks were accessed locally and which are not? >> >> Thanks! >> > >