Ethan Yi created MAHOUT-1700:
--------------------------------
Summary: OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD
Key: MAHOUT-1700
URL: https://issues.apache.org/jira/browse/MAHOUT-1700
Project: Mahout
Issue Type: Bug
Components: Math
Affects Versions: 0.10.0, 0.9
Reporter: Ethan Yi
Fix For: 0.10.1
Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0) job. There's
a java heap space out of memory problem in ABtDenseOutJob. I found the reason,
the ABtDenseOutJob map code is as below:
protected void map(Writable key, VectorWritable value, Context context)
throws IOException, InterruptedException {
Vector vec = value.get();
int vecSize = vec.size();
if (aCols == null) {
aCols = new Vector[vecSize];
} else if (aCols.length < vecSize) {
aCols = Arrays.copyOf(aCols, vecSize);
}
if (vec.isDense()) {
for (int i = 0; i < vecSize; i++) {
extendAColIfNeeded(i, aRowCount + 1);
aCols[i].setQuick(aRowCount, vec.getQuick(i));
}
} else if (vec.size() > 0) {
for (Vector.Element vecEl : vec.nonZeroes()) {
int i = vecEl.index();
extendAColIfNeeded(i, aRowCount + 1);
aCols[i].setQuick(aRowCount, vecEl.get());
}
}
aRowCount++;
}
If the input is RandomAccessSparseVector, usually with big data, it's
vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new
Vector[vecSize] will introduce the OutOfMemory problem. The settlement of
course should be enlarge every tasktracker's maximum memory:
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
</property>
However, if you are NOT hadoop administrator or ops, you have no permission to
modify the config. So, I try to modify ABtDenseOutJob map code to support
RandomAccessSparseVector situation, I use hashmap to represent aCols instead of
the original Vector[] aCols array, the modified code is as below:
private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>();
protected void map(Writable key, VectorWritable value, Context context)
throws IOException, InterruptedException {
Vector vec = value.get();
if (vec.isDense()) {
for (int i = 0; i < vecSize; i++) {
//extendAColIfNeeded(i, aRowCount + 1);
if (aColsMap.get(i) == null) {
aColsMap.put(i, new
RandomAccessSparseVector(Integer.MAX_VALUE, 100));
}
aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i));
//aCols[i].setQuick(aRowCount, vec.getQuick(i));
}
} else if (vec.size() > 0) {
for (Vector.Element vecEl : vec.nonZeroes()) {
int i = vecEl.index();
//extendAColIfNeeded(i, aRowCount + 1);
if (aColsMap.get(i) == null) {
aColsMap.put(i, new
RandomAccessSparseVector(Integer.MAX_VALUE, 100));
}
aColsMap.get(i).setQuick(aRowCount, vecEl.get());
//aCols[i].setQuick(aRowCount, vecEl.get());
}
}
aRowCount++;
}
Then the OutofMemory problem is dismissed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)