I agree too, yours is even nicer. I'm going to pull together all the options and all the attendant file taxonomies under the MAHOUT-294 banner. Its clear this all needs an overall plan and cannot be approached well piecemeal.

On 5/23/10 7:16 PM, Drew Farris (JIRA) wrote:
      [ 
https://issues.apache.org/jira/browse/MAHOUT-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-398:
-------------------------------

     Attachment: MAHOUT-398.patch

Jeff, I agree, it makes sense to make this a bit more consistent across outputs

After the minor changes you propose, the output produced by the reuters example 
when constructing tfidf vectors looks like this:

{code}
.../reuters-out-seqdir-sparse/dictionary.file-0
.../reuters-out-seqdir-sparse/tfidf
../reuters-out-seqdir-sparse/tfidf/frequency.file-0
.../reuters-out-seqdir-sparse/tfidf/df-count
.../reuters-out-seqdir-sparse/tfidf/df-count/part-00000
.../reuters-out-seqdir-sparse/tfidf/tfidf-vectors
.../reuters-out-seqdir-sparse/tfidf/tfidf-vectors/part-00000
.../reuters-out-seqdir-sparse/tf-vectors
.../reuters-out-seqdir-sparse/tf-vectors/part-00000
.../reuters-out-seqdir-sparse/tokenized-documents
.../reuters-out-seqdir-sparse/tokenized-documents/part-00000
.../reuters-out-seqdir-sparse/wordcount
.../reuters-out-seqdir-sparse/wordcount/part-00000
{code}

How about we the tfidf-vectors and tf-vectors output directories at the same 
level? I seems that putting frequency.file and dictionary.file at the same 
level might make some sense. I know there's been some talk about standardizing 
input, working and output directory creation for jobs but I haven't followed it 
-- might that provide some suggestion what to do here?

Here's a patch that includes Jeff's changes and pushes the tfidf stuff up a 
level. The output is:

{code}
reuters-out-seqdir-sparse/dictionary.file-0
reuters-out-seqdir-sparse/frequency.file-0
reuters-out-seqdir-sparse/tf-vectors
reuters-out-seqdir-sparse/tf-vectors/part-00000
reuters-out-seqdir-sparse/tokenized-documents
reuters-out-seqdir-sparse/tokenized-documents/part-00000
reuters-out-seqdir-sparse/df-count
reuters-out-seqdir-sparse/df-count/part-00000
reuters-out-seqdir-sparse/tfidf-vectors
reuters-out-seqdir-sparse/tfidf-vectors/part-00000
reuters-out-seqdir-sparse/wordcount
reuters-out-seqdir-sparse/wordcount/part-00000
{code}

Seq2sparse outputs final vectors to different directories depending upon the 
TF/TFIDF weight switch. This is confusing to users.
--------------------------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-398
                 URL: https://issues.apache.org/jira/browse/MAHOUT-398
             Project: Mahout
          Issue Type: Improvement
          Components: Utils
    Affects Versions: 0.3
            Reporter: Jeff Eastman
             Fix For: 0.4

         Attachments: MAHOUT-398.patch


In TF mode, seq2sparse puts the output vectors into<output>/vectors. In TFIDF mode; 
however, it puts the output vectors into<output>/tfidf/vectors. This happens because 
the IDF calculation - if it is selected - happens after TF and uses the TF vectors for its 
input.
Seems like both modes ought to output to a consistent directory structure so changing 
the switch does not change the final output location: perhaps as simple as changing 
TF to output to<output>/tf/vectors so that the contents of both directories 
when present are more obvious from their nomenclature.

Reply via email to