[
https://issues.apache.org/jira/browse/MAHOUT-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Drew Farris updated MAHOUT-398:
-------------------------------
Attachment: MAHOUT-398.patch
Jeff, I agree, it makes sense to make this a bit more consistent across outputs
After the minor changes you propose, the output produced by the reuters example
when constructing tfidf vectors looks like this:
{code}
.../reuters-out-seqdir-sparse/dictionary.file-0
.../reuters-out-seqdir-sparse/tfidf
../reuters-out-seqdir-sparse/tfidf/frequency.file-0
.../reuters-out-seqdir-sparse/tfidf/df-count
.../reuters-out-seqdir-sparse/tfidf/df-count/part-00000
.../reuters-out-seqdir-sparse/tfidf/tfidf-vectors
.../reuters-out-seqdir-sparse/tfidf/tfidf-vectors/part-00000
.../reuters-out-seqdir-sparse/tf-vectors
.../reuters-out-seqdir-sparse/tf-vectors/part-00000
.../reuters-out-seqdir-sparse/tokenized-documents
.../reuters-out-seqdir-sparse/tokenized-documents/part-00000
.../reuters-out-seqdir-sparse/wordcount
.../reuters-out-seqdir-sparse/wordcount/part-00000
{code}
How about we the tfidf-vectors and tf-vectors output directories at the same
level? I seems that putting frequency.file and dictionary.file at the same
level might make some sense. I know there's been some talk about standardizing
input, working and output directory creation for jobs but I haven't followed it
-- might that provide some suggestion what to do here?
Here's a patch that includes Jeff's changes and pushes the tfidf stuff up a
level. The output is:
{code}
reuters-out-seqdir-sparse/dictionary.file-0
reuters-out-seqdir-sparse/frequency.file-0
reuters-out-seqdir-sparse/tf-vectors
reuters-out-seqdir-sparse/tf-vectors/part-00000
reuters-out-seqdir-sparse/tokenized-documents
reuters-out-seqdir-sparse/tokenized-documents/part-00000
reuters-out-seqdir-sparse/df-count
reuters-out-seqdir-sparse/df-count/part-00000
reuters-out-seqdir-sparse/tfidf-vectors
reuters-out-seqdir-sparse/tfidf-vectors/part-00000
reuters-out-seqdir-sparse/wordcount
reuters-out-seqdir-sparse/wordcount/part-00000
{code}
> Seq2sparse outputs final vectors to different directories depending upon the
> TF/TFIDF weight switch. This is confusing to users.
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-398
> URL: https://issues.apache.org/jira/browse/MAHOUT-398
> Project: Mahout
> Issue Type: Improvement
> Components: Utils
> Affects Versions: 0.3
> Reporter: Jeff Eastman
> Fix For: 0.4
>
> Attachments: MAHOUT-398.patch
>
>
> In TF mode, seq2sparse puts the output vectors into <output>/vectors. In
> TFIDF mode; however, it puts the output vectors into <output>/tfidf/vectors.
> This happens because the IDF calculation - if it is selected - happens after
> TF and uses the TF vectors for its input.
> Seems like both modes ought to output to a consistent directory structure so
> changing the switch does not change the final output location: perhaps as
> simple as changing TF to output to <output>/tf/vectors so that the contents
> of both directories when present are more obvious from their nomenclature.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.