[
https://issues.apache.org/jira/browse/HADOOP-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630461#action_12630461
]
chris.douglas edited comment on HADOOP-3638 at 9/11/08 7:24 PM:
----------------------------------------------------------------
This looks good. A few suggestions:
* In IndexCache::getIndexInformation, the write to the timestamp isn't
guranteed to be
[atomic|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7].
Moving it into the synchronized block above it, using an AtomicLong, or
setting it to be volatile should all work (realistically, it's hugely unlikely
to happen, be significant, and have a measurable performance impact, but it's
worth noting).
* synchronizing on the intern'd value of a string isn't safe. String maintains
its pool of intern'd Strings until the JVM shuts down, so using it here will
cause each mapId to add an entry to this pool without ever removing it.
Further, depending on the semantics of intern() for thread safety seems risky.
It makes more sense to use a more conventional synchronization strategy.
* Why is DataOutputStream imported in IFileInputStream?
* The semantics of the comparator sorting the LRU cache aren't consistent. If
o1 is null and o2 is not, o1 < o2 and o2 < o1.
* Is it necessary to copy readLong and writeLong from java.io classes? Since
both are Filter\*Stream classes, isn't it sufficient to wrap each in the
appropriate Data\*Stream?
* readIndexFile doesn't need to set the timeStamp; it's set in
getIndexInformation after the call to readIndexFileToCache, which is more
appropriate. If the LRU cache were replaced with some other metric, it would
make sense to pass in an IndexInformation object from cache to readIndexFile,
and fill in the timestamp in a subclass associated with the cache.
IndexInformation::readIndexFile would fill in the index table, and mapId but
setting the timestamp field could be (and effectively is) left to the cache.
We're unlikely to add new caches, but this keeps the timestamp associated with
the LRU cache and not the index reader.
* In IndexCache::readIndexFileToCache, the "excess" memory is calculated behind
a block synchronized on the IndexCache instance, then that amount is freed in a
block synchronized on a member field. It's possible, even likely, that multiple
threads will free more memory than is necessary. For example, let the cache
have a large, cached index in memory M0 with the lowest timestamp. If T1..Tn
call readIndexFileToCache, say T1 is the first to grab the lock on cache in
IndexCache::freeIndexInformation. It's legal for T2..Tn to get blocked in
freeIndexInformation before T1 frees M0. T2..Tn will have calculated the number
of bytes to free assuming M0 was still present, which is overly aggressive.
was (Author: chris.douglas):
This looks good. A few suggestions:
* In IndexCache::getIndexInformation, the write to the timestamp isn't
guranteed to be
[atomic|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7].
Moving it into the synchronized block above it, using an AtomicLong, or
setting it to be volatile should all work.
* synchronizing on the intern'd value of a string isn't safe. String maintains
its pool of intern'd Strings until the JVM shuts down, so using it here will
cause each mapId to add an entry to this pool without ever removing it.
Further, depending on the semantics of intern() for thread safety seems risky.
It makes more sense to use a more conventional synchronization strategy.
* Why is DataOutputStream imported in IFileInputStream?
* The semantics of the comparator sorting the LRU cache aren't consistent. If
o1 is null and o2 is not, o1 < o2 and o2 < o1.
* Is it necessary to copy readLong and writeLong from java.io classes? Since
both are Filter\*Stream classes, isn't it sufficient to wrap each in the
appropriate Data\*Stream?
* readIndexFile doesn't need to set the timeStamp; it's set in
getIndexInformation after the call to readIndexFileToCache, which is more
appropriate. If the LRU cache were replaced with some other metric, it would
make sense to pass in an IndexInformation object from cache to readIndexFile,
and fill in the timestamp in a subclass associated with the cache.
IndexInformation::readIndexFile would fill in the index table, and mapId but
setting the timestamp field could be (and effectively is) left to the cache.
We're unlikely to add new caches, but this keeps the timestamp associated with
the LRU cache and not the index reader.
* In IndexCache::readIndexFileToCache, the "excess" memory is calculated behind
a block synchronized on the IndexCache instance, then that amount is freed in a
block synchronized on a member field. It's possible, even likely, that multiple
threads will free more memory than is necessary. For example, let the cache
have a large, cached index in memory M0 with the lowest timestamp. If T1..Tn
call readIndexFileToCache, say T1 is the first to grab the lock on cache in
IndexCache::freeIndexInformation. It's legal for T2..Tn to get blocked in
freeIndexInformation before T1 frees M0. T2..Tn will have calculated the number
of bytes to free assuming M0 was still present, which is overly aggressive.
> Cache the iFile index files in memory to reduce seeks during map output
> serving
> -------------------------------------------------------------------------------
>
> Key: HADOOP-3638
> URL: https://issues.apache.org/jira/browse/HADOOP-3638
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.17.0
> Reporter: Devaraj Das
> Assignee: Jothi Padmanabhan
> Fix For: 0.19.0
>
> Attachments: hadoop-3638-v1.patch, hadoop-3638-v2.patch
>
>
> The iFile index files can be cached in memory to reduce seeks during map
> output serving.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.