[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738651#action_12738651
 ] 

Jay Booth commented on HDFS-516:
--------------------------------

Wow, thanks Raghu, that's awesome and will save me a ton of time.  A couple 
points for discussion:

* The random 4k byte grabber is awesome and I will be using it as part of my 
benchmarking at the first opportunity, however I think it's worth also testing 
some likely applications to really show the strength of client-side caching.  
10MB or so worth of properly warmed cache could mean your first 20 lookups in a 
binary search are almost-free, and having the frontmost 10% of a lucene index 
in cache will mean that almost all of the scoring portion of the search will be 
computed against local memory.  Meanwhile, for truly random reads, having a 
cache that's, say, 5-10% of the size of the data will only get you a small 
improvement.  So I'd like to get some numbers for use cases that really thrive 
on caching in addition to truly random access.    But that will be extremely 
useful for tuning the IO layer and establishing a baseline for cache-miss 
performance, so thanks for the heads up.

* I have a feeling that my implementation is significantly slower than the 
default when it comes to streaming, since it relies on successive, small 
positioned reads and a heavy memory footprint rather than a simple stream of 
bytes.  Watching my unit tests run on my laptop with a ton of confounding 
factors sure seemed that way, although that's not a scientific measurement (one 
more item to benchmark).  So while I agree with the urge for simplicity, I feel 
like we need to make that performance tradeoff clear.  Otherwise, we could have 
a lot of very slow mapreduce jobs happening.  Given that MapReduce is the 
primary use case for Hadoop, my instinct was to make RadFileSystem a 
non-default implementation.  Point very well taken about the BlockLocations and 
CRC verification, maybe the best way to handle future integration with DataNode 
would be to develop separately, reuse as much code as possible and then when 
RadFileSystem is mature and benchmarked we can revisit a merge with 
DistributedFileSystem?

Thanks again, I'll try and write a post later tonight with an explicit plan for 
benchmarking and then maybe people can comment and poke holes in it as they see 
fit?

> Low Latency distributed reads
> -----------------------------
>
>                 Key: HDFS-516
>                 URL: https://issues.apache.org/jira/browse/HDFS-516
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jay Booth
>            Priority: Minor
>         Attachments: radfs.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I created a method for low latency random reads using NIO on the server side 
> and simulated OS paging with LRU caching and lookahead on the client side.  
> Some applications could include lucene searching (term->doc and doc->offset 
> mappings are likely to be in local cache, thus much faster than nutch's 
> current FsDirectory impl and binary search through record files (bytes at 
> 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to