[
https://issues.apache.org/jira/browse/MAHOUT-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435773#comment-13435773
]
sakurai commented on MAHOUT-590:
--------------------------------
Hi, I downloaded mahout-distribution-0.7-src.tar.gz and after unzip everything
& compile it, i can run mahout. i wannt use your patch so that i can specify
TSV through "mahout seqdirectory --inputType TSV". Can you teach me how to
apply the patch? I did not check out the trunk. Thank you.
> add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory
> -----------------------------------------------------------------------------
>
> Key: MAHOUT-590
> URL: https://issues.apache.org/jira/browse/MAHOUT-590
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.4
> Environment: Mac OS X 10.6.6, java version "1.6.0_22"
> RHL Linux 2.6.18
> Reporter: Shige Takeda
> Assignee: Sean Owen
> Priority: Minor
> Fix For: 0.5
>
> Attachments: 0001-added-TSV-input-file-support.patch,
> MAHOUT-590.patch, MAHOUT-590.patch
>
>
> I would like to add TSV (Tab Separated Value) input file type support to
> SequenceFilesFromDirectory.
> Here is my real use case:
> I have 36M records of input, each of which consists of ID and CONTENT and
> various other attributes, and I wanted to convert them to sequence files for
> clustering records by term vectors of CONTENT. However the problem is since I
> cannot create 36M files under my home directory due to quota limit that is up
> to 50k files, I was not able to convert them to sequence files by
> SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV
> where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig
> and most hadoop stream programs to process as input and output. NOTE: CONTENT
> size is up to around 2k bytes. Hence I feel better TSV support by
> SequenceFilesFromDirectory directly instead of taking two steps; TSV to text
> files and text files to Sequence files.
> I'm attaching the patch.
> Hope this makes sense to other folks.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira