[jira] Issue Comment Edited: (HADOOP-3638) Cache the iFile index files in memory to reduce seeks during map output serving

Chris Douglas (JIRA) Thu, 11 Sep 2008 19:26:48 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630461#action_12630461
 ]


chris.douglas edited comment on HADOOP-3638 at 9/11/08 7:24 PM:
----------------------------------------------------------------

This looks good. A few suggestions:
* In IndexCache::getIndexInformation, the write to the timestamp isn't 
guranteed to be 
[atomic|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7].
 Moving it into the synchronized block above it, using an AtomicLong, or 
setting it to be volatile should all work (realistically, it's hugely unlikely 
to happen, be significant, and have a measurable performance impact, but it's 
worth noting).
* synchronizing on the intern'd value of a string isn't safe. String maintains 
its pool of intern'd Strings until the JVM shuts down, so using it here will 
cause each mapId to add an entry to this pool without ever removing it. 
Further, depending on the semantics of intern() for thread safety seems risky. 
It makes more sense to use a more conventional synchronization strategy.
* Why is DataOutputStream imported in IFileInputStream?
* The semantics of the comparator sorting the LRU cache aren't consistent. If 
o1 is null and o2 is not, o1 < o2 and o2 < o1.
* Is it necessary to copy readLong and writeLong from java.io classes? Since 
both are Filter\*Stream classes, isn't it sufficient to wrap each in the 
appropriate Data\*Stream?
* readIndexFile doesn't need to set the timeStamp; it's set in 
getIndexInformation after the call to readIndexFileToCache, which is more 
appropriate. If the LRU cache were replaced with some other metric, it would 
make sense to pass in an IndexInformation object from cache to readIndexFile, 
and fill in the timestamp in a subclass associated with the cache. 
IndexInformation::readIndexFile would fill in the index table, and mapId but 
setting the timestamp field could be (and effectively is) left to the cache. 
We're unlikely to add new caches, but this keeps the timestamp associated with 
the LRU cache and not the index reader.
* In IndexCache::readIndexFileToCache, the "excess" memory is calculated behind 
a block synchronized on the IndexCache instance, then that amount is freed in a 
block synchronized on a member field. It's possible, even likely, that multiple 
threads will free more memory than is necessary. For example, let the cache 
have a large, cached index in memory M0 with the lowest timestamp. If T1..Tn 
call readIndexFileToCache, say T1 is the first to grab the lock on cache in 
IndexCache::freeIndexInformation. It's legal for T2..Tn to get blocked in 
freeIndexInformation before T1 frees M0. T2..Tn will have calculated the number 
of bytes to free assuming M0 was still present, which is overly aggressive.

      was (Author: chris.douglas):
    This looks good. A few suggestions:
* In IndexCache::getIndexInformation, the write to the timestamp isn't 
guranteed to be 
[atomic|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7].
 Moving it into the synchronized block above it, using an AtomicLong, or 
setting it to be volatile should all work.
* synchronizing on the intern'd value of a string isn't safe. String maintains 
its pool of intern'd Strings until the JVM shuts down, so using it here will 
cause each mapId to add an entry to this pool without ever removing it. 
Further, depending on the semantics of intern() for thread safety seems risky. 
It makes more sense to use a more conventional synchronization strategy.
* Why is DataOutputStream imported in IFileInputStream?
* The semantics of the comparator sorting the LRU cache aren't consistent. If 
o1 is null and o2 is not, o1 < o2 and o2 < o1.
* Is it necessary to copy readLong and writeLong from java.io classes? Since 
both are Filter\*Stream classes, isn't it sufficient to wrap each in the 
appropriate Data\*Stream?
* readIndexFile doesn't need to set the timeStamp; it's set in 
getIndexInformation after the call to readIndexFileToCache, which is more 
appropriate. If the LRU cache were replaced with some other metric, it would 
make sense to pass in an IndexInformation object from cache to readIndexFile, 
and fill in the timestamp in a subclass associated with the cache. 
IndexInformation::readIndexFile would fill in the index table, and mapId but 
setting the timestamp field could be (and effectively is) left to the cache. 
We're unlikely to add new caches, but this keeps the timestamp associated with 
the LRU cache and not the index reader.
* In IndexCache::readIndexFileToCache, the "excess" memory is calculated behind 
a block synchronized on the IndexCache instance, then that amount is freed in a 
block synchronized on a member field. It's possible, even likely, that multiple 
threads will free more memory than is necessary. For example, let the cache 
have a large, cached index in memory M0 with the lowest timestamp. If T1..Tn 
call readIndexFileToCache, say T1 is the first to grab the lock on cache in 
IndexCache::freeIndexInformation. It's legal for T2..Tn to get blocked in 
freeIndexInformation before T1 frees M0. T2..Tn will have calculated the number 
of bytes to free assuming M0 was still present, which is overly aggressive.
  
> Cache the iFile index files in memory to reduce seeks during map output 
> serving
> -------------------------------------------------------------------------------
>
>                 Key: HADOOP-3638
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3638
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3638-v1.patch, hadoop-3638-v2.patch
>
>
> The iFile index files can be cached in memory to reduce seeks during map 
> output serving.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3638) Cache the iFile index files in memory to reduce seeks during map output serving

Reply via email to