Sean Owen created MAHOUT-1771:
---------------------------------
Summary: Cluster dumper omits indices and 0 elements for dense
vectors
Key: MAHOUT-1771
URL: https://issues.apache.org/jira/browse/MAHOUT-1771
Project: Mahout
Issue Type: Bug
Components: Clustering, mrlegacy
Affects Versions: 0.9
Reporter: Sean Owen
Priority: Minor
Blast from the past -- are patches still being accepted for "mrlegacy" code?
Something turned up incidentally when working with a customer that looks like a
minor bug in the cluster dumper code.
In {{AbstractCluster.java}}:
{code}
public static List<Object> formatVectorAsJson(Vector v, String[] bindings)
throws IOException {
boolean hasBindings = bindings != null;
boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != v.size();
// we assume sequential access in the output
Vector provider = v.isSequentialAccess() ? v : new
SequentialAccessSparseVector(v);
List<Object> terms = new LinkedList<>();
String term = "";
for (Element elem : provider.nonZeroes()) {
if (hasBindings && bindings.length >= elem.index() + 1 &&
bindings[elem.index()] != null) {
term = bindings[elem.index()];
} else if (hasBindings || isSparse) {
term = String.valueOf(elem.index());
}
Map<String, Object> term_entry = new HashMap<>();
double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
if (hasBindings || isSparse) {
term_entry.put(term, roundedWeight);
terms.add(term_entry);
} else {
terms.add(roundedWeight);
}
}
return terms;
}
{code}
Imagine a {{DenseVector}} with 5 elements, of which two are 0. It's considered
dense in this method since the number of non-default elements is 5 (all
elements are "non default" in a dense vector).
However the iteration is over non-zero elements only. And indices are only
printed if it's sparse (or has bindings). So the result will be the 3 non-zero
elements printed without indices. Which dimensions they are can't be determined.
The fix seems to be either:
- Compare number of _non-zero_ elements to the size when determining if it's
sparse
- Iterate over all elements if non-sparse
I think the first is the intent? it would be a one-line change if so.
{code}
boolean isSparse = !v.isDense() && v.getNumZeroElements() != v.size();
{code}
Pretty straightforward, and minor, but wanted to check with everyone before
making a change.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)