[jira] [Issue Comment Edited] (MAHOUT-633) Add SequenceFileIterable; put Iterable stuff in one place

Dmitriy Lyubimov (JIRA) Sun, 27 Mar 2011 17:43:49 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011879#comment-13011879
 ]


Dmitriy Lyubimov edited comment on MAHOUT-633 at 3/28/11 12:42 AM:
-------------------------------------------------------------------

seems good to me. 
I am not stuck on certain style issues, i just used eclipse's ctrl-shift-f 
autoformat. So it's fine it's just hard for me to tell 'good' from 'bad' 
sometimes. Such as i don't use compound block { .. } if there's only one 
statement in there. it's just something i do. I thought it was also Sun's 
convention but I don't have a strong opinion. 

I see that you don't support glob expressions for multiple files yet? Did you 
decide not to support multiple files or it's TBD or i am missing some extra 
helper functionality here?

I see that you applied that to loading the (R_1, R_2...R_z). That seems 
technically correct to me the way it is done here. But irony is that couldn't 
use a glob expression parameterized cause order of side R files is relevant. So 
if you ever switch to multiple files by Hadoop glob, that piece would not be 
covered. Perhaps there's a space for a contract to supply both glob and 
comparator for files selecting so they are opened in a predefined order (like 
there in BtJob).

perhaps a constructor version 
{code} 
SequenceFileIterator ( String(or path) glob, Comparator<String or Path> 
openSequenceComparator)
{code} would solve that in one bite. 

There's a helper method in SSVDSolver that does the same (load matrices in 
memory). That probably has an extensive use -- i think in tests mostly but it 
might be used something else. So that might be refactored to use this as well. 
but i can attend to it myself later. 

I am kind of inclined to frown on not reusing Writables. That's what hadoop 
does and there's a rather good reason for that. GC thrashing. Effects on 
massive datasets are felt especially if the job is CPU-bound (which is how the 
SSVD is). I actually see up to 20% improvements in running times when i equip 
VectorWritable with preprocessing ability so that it doesn't create a vector 
for every iteration -- just because of GC thrashing. SSVD solver relies on that 
to load side R sequences -- currently, all at once into RAM that's currently a 
bottleneck for RAM scalability. It can be thus much more of data than that of a 
hadoop split, so that better be done as efficiently as possible. GC thrashing 
probably going to be especially bad in this case since we are trying to load a 
lot of stuff into memory which has the same (tenured) lifespan and GC will have 
especially bad overhead in Young gen space especially as we approach to RAM 
limits for the R data. 

Problem usually is that it is o.k. to approach RAM limits when you don't need 
to relinquish objects. But if you do a lot of (small dead) GC material floating 
around, full GCs including memory defrags will occur more often and take more 
time to a degree that you'd eventually be more busy doing GCs than anything 
else on 64bit systems with large heaps (>2G xmx). This may last for quite a 
long time before you are bound to converge on OOM. But if your RAM use 
prediction is tuned accurately, you'd never hit OOM but it'd be just painfully 
slow in java (i have a lot of experience with seeing that). The only remedy 
that i know is to preallocate near-limit RAM that you need in advance and in 
big chunks and  not to create any (if possible, at all) 'dead small' GC stuff 
per iteration. (IMO). That said, that purely comes from observing the effects, 
I am not actually very strong on inner workings of GC.

   

      was (Author: dlyubimov):
    seems good to me. 
I am not stuck on certain style issues, i just used eclipse's ctrl-shift-f 
autoformat. So it's fine it's just hard for me to tell 'good' from 'bad' 
sometimes. Such as i don't use compound block { .. } if there's only one 
statement in there. it's just something i do. I thought it was also Sun's 
convention but I don't have a strong opinion. 

I see that you don't support glob expressions for multiple files yet? Did you 
decide not to support multiple files or it's TBD or i am missing some extra 
helper functionality here?

I see that you applied that to loading the (R_1, R_2...R_z). That seems 
technically correct to me the way it is done here. But irony is that couldn't 
use a glob expression parameterized cause order of side R files is relevant. So 
if you ever switch to multiple files by Hadoop glob, that piece would not be 
covered. Perhaps there's a space for a contract to supply both glob and 
comparator for files selecting so they are opened in a predefined order (like 
there in BtJob).

perhaps a constructor version 
{code} 
SequenceFileIterator ( String(or path) glob, Comparator<String or Path> 
openSequenceComparator)
{code} would solve that in one bite. 

There's a helper method in SSVDSolver that does the same (load matrices in 
memory). That probably has an extensive use -- i think in tests mostly but it 
might be used something else. So that might be refactored to use this as well. 
but i can attend to it myself later. 

I am kind of inclined to frown on not reusing Writables. That's what hadoop 
does and there's a rather good reason for that. GC thrashing. Effects on 
massive datasets are felt especially if the job is CPU-bound. I actually see up 
to 20% improvements in running times when i equip VectorWritable with 
preprocessing ability so that it doesn't create a vector for every iteration -- 
just because of GC thrashing. SSVD solver relies on that to load side R 
sequences -- currently, all at once into RAM that's currently a bottleneck for 
RAM scalability. It can be thus much more of data than that of a hadoop split, 
so that better be done as efficiently as possible. GC thrashing probably going 
to be especially bad in this case since we are trying to load a lot of stuff 
into memory which has the same (tenured) lifespan and GC will have especially 
bad overhead in Young gen space especially as we approach to RAM limits for the 
R data. 

Problem usually is that it is o.k. to approach RAM limits when you don't need 
to relinquish objects. But if you do a lot of (small dead) GC material floating 
around, full GCs including memory defrags will occur more often and take more 
time to a degree that you'd eventually be more busy doing GCs than anything 
else on 64bit systems with large heaps (>2G xmx). This may last for quite a 
long time before you are bound to converge on OOM. But if your RAM use 
prediction is tuned accurately, you'd never hit OOM but it'd be just painfully 
slow in java (i have a lot of experience with seeing that). The only remedy 
that i know is to preallocate near-limit RAM that you need in advance and in 
big chunks and  not to create any (if possible, at all) 'dead small' GC stuff 
per iteration. (IMO). That said, that purely comes from observing the effects, 
I am not actually very strong on inner workings of GC.

   
  
> Add SequenceFileIterable; put Iterable stuff in one place
> ---------------------------------------------------------
>
>                 Key: MAHOUT-633
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-633
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering, Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: iterable, iterator, sequence-file
>             Fix For: 0.5
>
>         Attachments: MAHOUT-633.patch, MAHOUT-633.patch
>
>
> In another project I have a useful little class, SequenceFileIterable, which 
> simplifies iterating over a sequence file. It's like FileLineIterable. I'd 
> like to add it, then use it throughout the code. See patch, which for now 
> merely has the proposed new classes. 
> Well it also moves some other iterator-related classes that seemed to be 
> outside their rightful home in common.iterator.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-633) Add SequenceFileIterable; put Iterable stuff in one place

Reply via email to