[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738651#action_12738651 ]
Jay Booth commented on HDFS-516: -------------------------------- Wow, thanks Raghu, that's awesome and will save me a ton of time. A couple points for discussion: * The random 4k byte grabber is awesome and I will be using it as part of my benchmarking at the first opportunity, however I think it's worth also testing some likely applications to really show the strength of client-side caching. 10MB or so worth of properly warmed cache could mean your first 20 lookups in a binary search are almost-free, and having the frontmost 10% of a lucene index in cache will mean that almost all of the scoring portion of the search will be computed against local memory. Meanwhile, for truly random reads, having a cache that's, say, 5-10% of the size of the data will only get you a small improvement. So I'd like to get some numbers for use cases that really thrive on caching in addition to truly random access. But that will be extremely useful for tuning the IO layer and establishing a baseline for cache-miss performance, so thanks for the heads up. * I have a feeling that my implementation is significantly slower than the default when it comes to streaming, since it relies on successive, small positioned reads and a heavy memory footprint rather than a simple stream of bytes. Watching my unit tests run on my laptop with a ton of confounding factors sure seemed that way, although that's not a scientific measurement (one more item to benchmark). So while I agree with the urge for simplicity, I feel like we need to make that performance tradeoff clear. Otherwise, we could have a lot of very slow mapreduce jobs happening. Given that MapReduce is the primary use case for Hadoop, my instinct was to make RadFileSystem a non-default implementation. Point very well taken about the BlockLocations and CRC verification, maybe the best way to handle future integration with DataNode would be to develop separately, reuse as much code as possible and then when RadFileSystem is mature and benchmarked we can revisit a merge with DistributedFileSystem? Thanks again, I'll try and write a post later tonight with an explicit plan for benchmarking and then maybe people can comment and poke holes in it as they see fit? > Low Latency distributed reads > ----------------------------- > > Key: HDFS-516 > URL: https://issues.apache.org/jira/browse/HDFS-516 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Jay Booth > Priority: Minor > Attachments: radfs.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > I created a method for low latency random reads using NIO on the server side > and simulated OS paging with LRU caching and lookahead on the client side. > Some applications could include lucene searching (term->doc and doc->offset > mappings are likely to be in local cache, thus much faster than nutch's > current FsDirectory impl and binary search through record files (bytes at > 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.