Hello Andrew!

Thanks for reply. Which logs and on what level should I check? Driver,
master or worker?

I found this on master node, but there is only ANY locality requirement.
Here it is the driver (spark sql) log -
https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers
log - https://gist.github.com/13h3r/6e5053cf0dbe33f2aaaa

Do you have any idea where to look at?

Thanks!

On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Alexey,
>
> You should see in the logs a locality measure like NODE_LOCAL,
> PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
> on them and you're reading out of HDFS, then you should be seeing almost
> all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
> uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
> think the data is local and does remote reads which really kills
> performance.
>
> Hope that helps!
> Andrew
>
> On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk <
> alexey.romanc...@gmail.com> wrote:
>
>> Hello again spark users and developers!
>>
>> I have standalone spark cluster (1.1.0) and spark sql running on it. My
>> cluster consists of 4 datanodes and replication factor of files is 3.
>>
>> I use thrift server to access spark sql and have 1 table with 30+
>> partitions. When I run query on whole table (something simple like select
>> count(*) from t) spark produces a lot of network activity filling all
>> available 1gb link. Looks like spark sent data by network instead of local
>> reading.
>>
>> Is it any way to log which blocks were accessed locally and which are not?
>>
>> Thanks!
>>
>
>

Reply via email to