Wrong thread important none the less. IMO one of the greatest strength of Pig is it careless approach to data formats. Text in a simple delimited format is just fine by Pig.
What Mahout needs is an import/export framework that will take non-fussy formats in and hand the same back. The Jira for this is here: https://issues.apache.org/jira/browse/MAHOUT-1507 It has evolved from a plea for the math internals to support external IDs into the above discussion of import/export. Support for Mahout’s consumption and production of flexible formats seems really important to me. The complexity of creating a completely scalable id dictionary alone (a task every user now must do for themselves) not to mention the language agnostic part of this makes it important. Do this right and Mahout could take input directly from Pig or a bash script and produce true JSON as you are trying to do for clusterdump. Seems like the goals should be: 1) input and output format flexibility 2) user specified ID preservation at the job level 3) language agnostic, meaning input can be created by any any language (not just Java and Scala) and likewise for output. 4) dictionaries need not be know about by the user—maintaining a blackbox boundary around Mahout jobs. The question is should we support Pig-like data formats with UDFs to support more specific ones, or should we go with something like AVRO, which has universal language support but is more fussy about on-disk formats. I’m new to Avro, mostly I’ve used delimited files of non-fussy format, sometimes mixing a couple different delimiters to get more than just a table structure and using text files. You are a consultant and have seen a lot of this issue, any opinions? Probably a good question for the user list. On Apr 10, 2014, at 7:15 PM, Andrew Musselman (JIRA) <[email protected]> wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966137#comment-13966137 ] Andrew Musselman commented on MAHOUT-1489: ------------------------------------------ I was thinking the other day that what I may want out of this is the kind of clear data flow I get when I write Pig. For example: a = load 'u'; b = load 'v'; c = a%.%b store c into 'matrix-mult'; Is this the right thread for that conversation? > Interactive Scala & Spark Bindings Shell & Script processor > ----------------------------------------------------------- > > Key: MAHOUT-1489 > URL: https://issues.apache.org/jira/browse/MAHOUT-1489 > Project: Mahout > Issue Type: New Feature > Affects Versions: 1.0 > Reporter: Saikat Kanjilal > Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > > Build an interactive shell /scripting (just like spark shell). Something very > similar in R interactive/script runner mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
