[
https://issues.apache.org/jira/browse/MAHOUT-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Drew Farris updated MAHOUT-402:
-------------------------------
Attachment: MAHOUT-402.patch
Modifies VectorHelper to include the name in vectorToString() if the input
vector is a NamedVector. Also implements asFormatString in NamedVector to
create JSON output that includes a NamedVector's name in addition to the
contents of the delegate vector.
In some cases the VectorDumper's --printKey option can be used to achieve the
same effect except in some cases the key is not the same as the NamedVector
name, notably in the k-means vector output.
This should also address the case recently mentioned on the user list, where it
is not clear that Vectors are named vectors, see:
http://www.lucidimagination.com/search/document/713e7c8349727a29/reading_vectors_created_from_a_lucene_index#90f94a8d5bc78610
> NamedVectors are not readily identifiable in vectordumper output
> ----------------------------------------------------------------
>
> Key: MAHOUT-402
> URL: https://issues.apache.org/jira/browse/MAHOUT-402
> Project: Mahout
> Issue Type: Bug
> Components: Utils
> Affects Versions: 0.4
> Reporter: Drew Farris
> Priority: Minor
> Attachments: MAHOUT-402.patch
>
>
> When dumping a sequence file of Writable,NamedVector using vectordumper in
> either JSON or standard format, it is not apparent in the output that the
> vectors are indeed named vectors.
> For example, after applying MAHOUT-401 to produce NamedVectors from
> seq2sparse, I run:
> {code}
> ./bin/mahout vectordump -j -p -s
> ~/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
> {code}
> And get:
> {code}
> Input Path: /home/drew/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
> /reut2-000.sgm-0.txt
> {"class":"org.apache.mahout.math.RandomAccessSparseVector","vector" [...]
> {code}
> or when removing the -j argument:
> {code}
> /reut2-000.sgm-0.txt elts: {1026:3.0, 16150:1.0, 3338:3.0, 16147:1.0,
> 3339:1.0, 12240:1.0, [...]
> {code}
> The first case, when dumping JSON, is due to the fact that NamedVector simply
> calls its delegate's asFormatString method. Granted the naive approach of
> implementing asFormatString in named vector also produces some nasty output:
> {code}
> /reut2-001.sgm-468.txt
> {"class":"org.apache.mahout.math.NamedVector","vector":"{\"delegate\":{\"class\":\"org.apache.mahout.math.RandomAccessSparseVector\"
> [...]
> {code}
> So a little more thought needs to be given to that approach.
> For the non-json format, VectorHelper.vectorToString(..) is the culprit.
> Would it be ok to do an instanceof NamedVector here and emit the name?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.