[ 
https://issues.apache.org/jira/browse/MAHOUT-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435773#comment-13435773
 ] 

sakurai commented on MAHOUT-590:
--------------------------------

Hi, I downloaded mahout-distribution-0.7-src.tar.gz and after unzip everything 
& compile it, i can run mahout. i wannt use your patch so that i can specify 
TSV through "mahout seqdirectory --inputType TSV". Can you teach me how to 
apply the patch? I did not check out the trunk. Thank you.
                
> add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-590
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-590
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.4
>         Environment: Mac OS X 10.6.6, java version "1.6.0_22"
> RHL Linux 2.6.18
>            Reporter: Shige Takeda
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: 0001-added-TSV-input-file-support.patch, 
> MAHOUT-590.patch, MAHOUT-590.patch
>
>
> I would like to add TSV (Tab Separated Value) input file type support to 
> SequenceFilesFromDirectory.
> Here is my real use case:
> I have 36M records of input, each of which consists of ID and CONTENT and 
> various other attributes, and I wanted to convert them to sequence files for 
> clustering records by term vectors of CONTENT. However the problem is since I 
> cannot create 36M files under my home directory due to quota limit that is up 
> to 50k files, I was not able to convert them to sequence files by 
> SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV 
> where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig 
> and most hadoop stream programs to process as input and output. NOTE: CONTENT 
> size is up to around 2k bytes. Hence I feel better TSV support by 
> SequenceFilesFromDirectory directly instead of taking two steps; TSV to text 
> files and text files to Sequence files.
> I'm attaching the patch.
> Hope this makes sense to other folks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to