[jira] [Created] (MAHOUT-1700) OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD

Ethan Yi (JIRA) Mon, 27 Apr 2015 20:36:06 -0700

Ethan Yi created MAHOUT-1700:
--------------------------------

             Summary: OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD
                 Key: MAHOUT-1700
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1700
             Project: Mahout
          Issue Type: Bug
          Components: Math
    Affects Versions: 0.10.0, 0.9
            Reporter: Ethan Yi
             Fix For: 0.10.1



 Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0)  job. There's 
a java heap space out of memory problem  in ABtDenseOutJob. I found the reason, 
the ABtDenseOutJob map code is as below:

    protected void map(Writable key, VectorWritable value, Context context)
      throws IOException, InterruptedException {

      Vector vec = value.get();

      int vecSize = vec.size();
      if (aCols == null) {
        aCols = new Vector[vecSize];
      } else if (aCols.length < vecSize) {
        aCols = Arrays.copyOf(aCols, vecSize);
      }

      if (vec.isDense()) {
        for (int i = 0; i < vecSize; i++) {
          extendAColIfNeeded(i, aRowCount + 1);
          aCols[i].setQuick(aRowCount, vec.getQuick(i));
        }
      } else if (vec.size() > 0) {
        for (Vector.Element vecEl : vec.nonZeroes()) {
          int i = vecEl.index();
          extendAColIfNeeded(i, aRowCount + 1);
          aCols[i].setQuick(aRowCount, vecEl.get());
        }
      }
      aRowCount++;
    }

If the input is RandomAccessSparseVector, usually with big data, it's 
vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new 
Vector[vecSize] will introduce the OutOfMemory problem. The settlement of 
course should be enlarge every tasktracker's maximum memory:
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1024m</value>
</property>
However, if you are NOT hadoop administrator or ops, you have no permission to 
modify the config. So, I try to modify ABtDenseOutJob map code to support 
RandomAccessSparseVector situation, I use hashmap to represent aCols instead of 
the original Vector[] aCols array, the modified code is as below:

private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>();
    protected void map(Writable key, VectorWritable value, Context context)
      throws IOException, InterruptedException {

      Vector vec = value.get();
      if (vec.isDense()) {
        for (int i = 0; i < vecSize; i++) {
          //extendAColIfNeeded(i, aRowCount + 1);
          if (aColsMap.get(i) == null) {
                  aColsMap.put(i, new 
RandomAccessSparseVector(Integer.MAX_VALUE, 100));
          }
          aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i));
          //aCols[i].setQuick(aRowCount, vec.getQuick(i));
        }
      } else if (vec.size() > 0) {
        for (Vector.Element vecEl : vec.nonZeroes()) {
          int i = vecEl.index();
          //extendAColIfNeeded(i, aRowCount + 1);
          if (aColsMap.get(i) == null) {
                  aColsMap.put(i, new 
RandomAccessSparseVector(Integer.MAX_VALUE, 100));
          }
          aColsMap.get(i).setQuick(aRowCount, vecEl.get());
          //aCols[i].setQuick(aRowCount, vecEl.get());
        }
      }
      aRowCount++;
    }

Then the OutofMemory problem is dismissed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MAHOUT-1700) OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD

Reply via email to