On Tue, Sep 8, 2015 at 1:38 AM, Sean Owen (JIRA) <[email protected]> wrote:
> Sean Owen created MAHOUT-1771: > --------------------------------- > > Summary: Cluster dumper omits indices and 0 elements for > dense vectors > Key: MAHOUT-1771 > URL: https://issues.apache.org/jira/browse/MAHOUT-1771 > Project: Mahout > Issue Type: Bug > Components: Clustering, mrlegacy > Affects Versions: 0.9 > Reporter: Sean Owen > Priority: Minor > > > Blast from the past -- are patches still being accepted for "mrlegacy" > code? Something turned up incidentally when working with a customer that > looks like a minor bug in the cluster dumper code. > > In {{AbstractCluster.java}}: > > {code} > public static List<Object> formatVectorAsJson(Vector v, String[] bindings) > throws IOException { > > boolean hasBindings = bindings != null; > boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != > v.size(); > > // we assume sequential access in the output > Vector provider = v.isSequentialAccess() ? v : new > SequentialAccessSparseVector(v); > > List<Object> terms = new LinkedList<>(); > String term = ""; > > for (Element elem : provider.nonZeroes()) { > > if (hasBindings && bindings.length >= elem.index() + 1 && > bindings[elem.index()] != null) { > term = bindings[elem.index()]; > } else if (hasBindings || isSparse) { > term = String.valueOf(elem.index()); > } > > Map<String, Object> term_entry = new HashMap<>(); > double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000; > if (hasBindings || isSparse) { > term_entry.put(term, roundedWeight); > terms.add(term_entry); > } else { > terms.add(roundedWeight); > } > } > > return terms; > } > {code} > > Imagine a {{DenseVector}} with 5 elements, of which two are 0. It's > considered dense in this method since the number of non-default elements is > 5 (all elements are "non default" in a dense vector). > > However the iteration is over non-zero elements only. And indices are only > printed if it's sparse (or has bindings). So the result will be the 3 > non-zero elements printed without indices. Which dimensions they are can't > be determined. > > The fix seems to be either: > - Compare number of _non-zero_ elements to the size when determining if > it's sparse > - Iterate over all elements if non-sparse > > I think the first is the intent? it would be a one-line change if so. > > {code} > boolean isSparse = !v.isDense() && v.getNumZeroElements() != v.size(); > {code} > > Pretty straightforward, and minor, but wanted to check with everyone > before making a change. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
