Note also for short circuit reads that early versions are actually net-negative in performance. Only after a second hadoop release of the feature did it turn towards being a positive change. See earlier threads on this mailing list where short circuit reads are discussed.
On Fri, Jan 9, 2015 at 3:57 PM, Ted Yu <yuzhih...@gmail.com> wrote: > I was looking for related information and found: > http://spark-summit.org/wp-content/uploads/2013/10/Spark-Ops-Final.pptx > > See also http://hbase.apache.org/book.html#perf.hdfs.configs.localread > for how short circuit read is enabled. > > Cheers > > On Fri, Jan 9, 2015 at 3:50 PM, Sean Owen <so...@cloudera.com> wrote: > >> Spark uses MapReduce InputFormat implementations to read data from >> disk, so in that sense it has access to, and uses, the same locality >> info that things like MR do. Yes, tasks go to the data, and you want >> to run Spark on top of the HDFS DataNodes. (Locality isn't always the >> only priority that determines where tasks are scheduled, but it >> certainly matters.) I'm not qualified enough to explain it in more >> detail, compared to others here. >> >> On Fri, Jan 9, 2015 at 10:13 PM, zfry <z...@palantir.com> wrote: >> > I am running Spark 1.1.1 built against CDH4 and have a few questions >> > regarding Spark performance related to co-location with HDFS nodes. >> > >> > I want to know whether (and how efficiently) Spark takes advantage of >> being >> > co-located with a HDFS node? >> > >> > What I mean by this is: if a file is being read by a Spark executor and >> that >> > file (or most of its blocks) is located in a HDFS DataNode on the same >> > machine as a Spark worker, will it read directly off of disk, or does >> that >> > data have to travel through the network in some way? Is there a distinct >> > advantage to putting HDFS and Spark on the same box if it is possible >> or, >> > due to the way blocks are distributed about a cluster, are we so likely >> to >> > be moving files over the network that co-location doesn’t really make >> that >> > much of a difference? >> > >> > Also, do you know of any papers/books/other resources (other trying to >> dig >> > through the spark code) which do a good job of explaining the Spark/HDFS >> > data workflow (ie. how data moves from disk -> HDFS -> Spark -> HDFS)? >> > >> > Thanks! >> > Zach >> > >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-and-HDFS-co-location-tp21070.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >