Note also for short circuit reads that early versions are actually
net-negative in performance.  Only after a second hadoop release of the
feature did it turn towards being a positive change.  See earlier threads
on this mailing list where short circuit reads are discussed.

On Fri, Jan 9, 2015 at 3:57 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> I was looking for related information and found:
> http://spark-summit.org/wp-content/uploads/2013/10/Spark-Ops-Final.pptx
>
> See also http://hbase.apache.org/book.html#perf.hdfs.configs.localread
> for how short circuit read is enabled.
>
> Cheers
>
> On Fri, Jan 9, 2015 at 3:50 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Spark uses MapReduce InputFormat implementations to read data from
>> disk, so in that sense it has access to, and uses, the same locality
>> info that things like MR do. Yes, tasks go to the data, and you want
>> to run Spark on top of the HDFS DataNodes. (Locality isn't always the
>> only priority that determines where tasks are scheduled, but it
>> certainly matters.) I'm not qualified enough to explain it in more
>> detail, compared to others here.
>>
>> On Fri, Jan 9, 2015 at 10:13 PM, zfry <z...@palantir.com> wrote:
>> > I am running Spark 1.1.1 built against CDH4 and have a few questions
>> > regarding Spark performance related to co-location with HDFS nodes.
>> >
>> > I want to know whether (and how efficiently) Spark takes advantage of
>> being
>> > co-located with a HDFS node?
>> >
>> > What I mean by this is: if a file is being read by a Spark executor and
>> that
>> > file (or most of its blocks) is located in a HDFS DataNode on the same
>> > machine as a Spark worker, will it read directly off of disk, or does
>> that
>> > data have to travel through the network in some way? Is there a distinct
>> > advantage to putting HDFS and Spark on the same box if it is possible
>> or,
>> > due to the way blocks are distributed about a cluster, are we so likely
>> to
>> > be moving files over the network that co-location doesn’t really make
>> that
>> > much of a difference?
>> >
>> > Also, do you know of any papers/books/other resources (other trying to
>> dig
>> > through the spark code) which do a good job of explaining the Spark/HDFS
>> > data workflow (ie. how data moves from disk -> HDFS -> Spark -> HDFS)?
>> >
>> > Thanks!
>> > Zach
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-and-HDFS-co-location-tp21070.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to