[ 
https://issues.apache.org/jira/browse/MAHOUT-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-402:
-------------------------------

    Attachment: MAHOUT-402.patch

Modifies VectorHelper to include the name in vectorToString() if the input 
vector is a NamedVector. Also implements asFormatString in NamedVector to 
create JSON output that includes a NamedVector's name in addition to the 
contents of the delegate vector.

In some cases the VectorDumper's --printKey option can be used to achieve the 
same effect except in some cases the key is not the same as the NamedVector 
name, notably in the k-means vector output.

This should also address the case recently mentioned on the user list, where it 
is not clear that Vectors are named vectors, see:
http://www.lucidimagination.com/search/document/713e7c8349727a29/reading_vectors_created_from_a_lucene_index#90f94a8d5bc78610

> NamedVectors are not readily identifiable in vectordumper output
> ----------------------------------------------------------------
>
>                 Key: MAHOUT-402
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-402
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.4
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: MAHOUT-402.patch
>
>
> When dumping a sequence file of Writable,NamedVector using vectordumper in 
> either JSON or standard format, it is not apparent in the output that the 
> vectors are indeed named vectors.
> For example, after applying MAHOUT-401 to produce NamedVectors from 
> seq2sparse, I run:
> {code}
> ./bin/mahout vectordump -j -p -s 
> ~/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
> {code}
> And get: 
> {code}
> Input Path: /home/drew/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
> /reut2-000.sgm-0.txt    
> {"class":"org.apache.mahout.math.RandomAccessSparseVector","vector" [...]
> {code}
> or when removing the -j argument:
> {code}
> /reut2-000.sgm-0.txt    elts: {1026:3.0, 16150:1.0, 3338:3.0, 16147:1.0, 
> 3339:1.0, 12240:1.0, [...]
> {code}
> The first case, when dumping JSON, is due to the fact that NamedVector simply 
> calls its delegate's asFormatString method. Granted the naive approach of 
> implementing asFormatString in named vector also produces some nasty output:
> {code}
> /reut2-001.sgm-468.txt        
> {"class":"org.apache.mahout.math.NamedVector","vector":"{\"delegate\":{\"class\":\"org.apache.mahout.math.RandomAccessSparseVector\"
>  [...]
> {code}
> So a little more thought needs to be given to that approach.
> For the non-json format, VectorHelper.vectorToString(..) is the culprit. 
> Would it be ok to do an instanceof NamedVector here and emit the name?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to