[
https://issues.apache.org/jira/browse/MAHOUT-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011879#comment-13011879
]
Dmitriy Lyubimov edited comment on MAHOUT-633 at 3/28/11 12:42 AM:
-------------------------------------------------------------------
seems good to me.
I am not stuck on certain style issues, i just used eclipse's ctrl-shift-f
autoformat. So it's fine it's just hard for me to tell 'good' from 'bad'
sometimes. Such as i don't use compound block { .. } if there's only one
statement in there. it's just something i do. I thought it was also Sun's
convention but I don't have a strong opinion.
I see that you don't support glob expressions for multiple files yet? Did you
decide not to support multiple files or it's TBD or i am missing some extra
helper functionality here?
I see that you applied that to loading the (R_1, R_2...R_z). That seems
technically correct to me the way it is done here. But irony is that couldn't
use a glob expression parameterized cause order of side R files is relevant. So
if you ever switch to multiple files by Hadoop glob, that piece would not be
covered. Perhaps there's a space for a contract to supply both glob and
comparator for files selecting so they are opened in a predefined order (like
there in BtJob).
perhaps a constructor version
{code}
SequenceFileIterator ( String(or path) glob, Comparator<String or Path>
openSequenceComparator)
{code} would solve that in one bite.
There's a helper method in SSVDSolver that does the same (load matrices in
memory). That probably has an extensive use -- i think in tests mostly but it
might be used something else. So that might be refactored to use this as well.
but i can attend to it myself later.
I am kind of inclined to frown on not reusing Writables. That's what hadoop
does and there's a rather good reason for that. GC thrashing. Effects on
massive datasets are felt especially if the job is CPU-bound (which is how the
SSVD is). I actually see up to 20% improvements in running times when i equip
VectorWritable with preprocessing ability so that it doesn't create a vector
for every iteration -- just because of GC thrashing. SSVD solver relies on that
to load side R sequences -- currently, all at once into RAM that's currently a
bottleneck for RAM scalability. It can be thus much more of data than that of a
hadoop split, so that better be done as efficiently as possible. GC thrashing
probably going to be especially bad in this case since we are trying to load a
lot of stuff into memory which has the same (tenured) lifespan and GC will have
especially bad overhead in Young gen space especially as we approach to RAM
limits for the R data.
Problem usually is that it is o.k. to approach RAM limits when you don't need
to relinquish objects. But if you do a lot of (small dead) GC material floating
around, full GCs including memory defrags will occur more often and take more
time to a degree that you'd eventually be more busy doing GCs than anything
else on 64bit systems with large heaps (>2G xmx). This may last for quite a
long time before you are bound to converge on OOM. But if your RAM use
prediction is tuned accurately, you'd never hit OOM but it'd be just painfully
slow in java (i have a lot of experience with seeing that). The only remedy
that i know is to preallocate near-limit RAM that you need in advance and in
big chunks and not to create any (if possible, at all) 'dead small' GC stuff
per iteration. (IMO). That said, that purely comes from observing the effects,
I am not actually very strong on inner workings of GC.
was (Author: dlyubimov):
seems good to me.
I am not stuck on certain style issues, i just used eclipse's ctrl-shift-f
autoformat. So it's fine it's just hard for me to tell 'good' from 'bad'
sometimes. Such as i don't use compound block { .. } if there's only one
statement in there. it's just something i do. I thought it was also Sun's
convention but I don't have a strong opinion.
I see that you don't support glob expressions for multiple files yet? Did you
decide not to support multiple files or it's TBD or i am missing some extra
helper functionality here?
I see that you applied that to loading the (R_1, R_2...R_z). That seems
technically correct to me the way it is done here. But irony is that couldn't
use a glob expression parameterized cause order of side R files is relevant. So
if you ever switch to multiple files by Hadoop glob, that piece would not be
covered. Perhaps there's a space for a contract to supply both glob and
comparator for files selecting so they are opened in a predefined order (like
there in BtJob).
perhaps a constructor version
{code}
SequenceFileIterator ( String(or path) glob, Comparator<String or Path>
openSequenceComparator)
{code} would solve that in one bite.
There's a helper method in SSVDSolver that does the same (load matrices in
memory). That probably has an extensive use -- i think in tests mostly but it
might be used something else. So that might be refactored to use this as well.
but i can attend to it myself later.
I am kind of inclined to frown on not reusing Writables. That's what hadoop
does and there's a rather good reason for that. GC thrashing. Effects on
massive datasets are felt especially if the job is CPU-bound. I actually see up
to 20% improvements in running times when i equip VectorWritable with
preprocessing ability so that it doesn't create a vector for every iteration --
just because of GC thrashing. SSVD solver relies on that to load side R
sequences -- currently, all at once into RAM that's currently a bottleneck for
RAM scalability. It can be thus much more of data than that of a hadoop split,
so that better be done as efficiently as possible. GC thrashing probably going
to be especially bad in this case since we are trying to load a lot of stuff
into memory which has the same (tenured) lifespan and GC will have especially
bad overhead in Young gen space especially as we approach to RAM limits for the
R data.
Problem usually is that it is o.k. to approach RAM limits when you don't need
to relinquish objects. But if you do a lot of (small dead) GC material floating
around, full GCs including memory defrags will occur more often and take more
time to a degree that you'd eventually be more busy doing GCs than anything
else on 64bit systems with large heaps (>2G xmx). This may last for quite a
long time before you are bound to converge on OOM. But if your RAM use
prediction is tuned accurately, you'd never hit OOM but it'd be just painfully
slow in java (i have a lot of experience with seeing that). The only remedy
that i know is to preallocate near-limit RAM that you need in advance and in
big chunks and not to create any (if possible, at all) 'dead small' GC stuff
per iteration. (IMO). That said, that purely comes from observing the effects,
I am not actually very strong on inner workings of GC.
> Add SequenceFileIterable; put Iterable stuff in one place
> ---------------------------------------------------------
>
> Key: MAHOUT-633
> URL: https://issues.apache.org/jira/browse/MAHOUT-633
> Project: Mahout
> Issue Type: Improvement
> Components: Classification, Clustering, Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Sean Owen
> Assignee: Sean Owen
> Priority: Minor
> Labels: iterable, iterator, sequence-file
> Fix For: 0.5
>
> Attachments: MAHOUT-633.patch, MAHOUT-633.patch
>
>
> In another project I have a useful little class, SequenceFileIterable, which
> simplifies iterating over a sequence file. It's like FileLineIterable. I'd
> like to add it, then use it throughout the code. See patch, which for now
> merely has the proposed new classes.
> Well it also moves some other iterator-related classes that seemed to be
> outside their rightful home in common.iterator.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira