Re: [jira] [Commented] (MAHOUT-1489) Interactive Scala & Spark Bindings Shell & Script processor

Pat Ferrel Fri, 11 Apr 2014 08:13:31 -0700

Wrong thread important none the less.

IMO one of the greatest strength of Pig is it careless approach to data 
formats. Text in a simple delimited format is just fine by Pig.

What Mahout needs is an import/export framework that will take non-fussy 
formats in and hand the same back. 

The Jira for this is here:
https://issues.apache.org/jira/browse/MAHOUT-1507

It has evolved from a plea for the math internals to support external IDs into 
the above discussion of import/export. Support for Mahout’s consumption and 
production of flexible formats seems really important to me. The complexity of 
creating a completely scalable id dictionary alone (a task every user now must 
do for themselves) not to mention the language agnostic part of this makes it 
important. Do this right and Mahout could take input directly from Pig or a 
bash script and produce true JSON as you are trying to do for clusterdump.

Seems like the goals should be:
1) input and output format flexibility
2) user specified ID preservation at the job level
3) language agnostic, meaning input can be created by any any language (not 
just Java and Scala) and likewise for output.
4) dictionaries need not be know about by the user—maintaining a blackbox 
boundary around Mahout jobs.  

The question is should we support Pig-like data formats with UDFs to support 
more specific ones, or should we go with something like AVRO, which has 
universal language support but is more fussy about on-disk formats. I’m new to 
Avro, mostly I’ve used delimited files of non-fussy format, sometimes mixing a 
couple different delimiters to get more than just a table structure and using 
text files.

You are a consultant and have seen a lot of this issue, any opinions?

Probably a good question for the user list.

On Apr 10, 2014, at 7:15 PM, Andrew Musselman (JIRA) <[email protected]> wrote:

   [ 
https://issues.apache.org/jira/browse/MAHOUT-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966137#comment-13966137
 ] 

Andrew Musselman commented on MAHOUT-1489:
------------------------------------------

I was thinking the other day that what I may want out of this is the kind of 
clear data flow I get when I write Pig.

For example:

a = load 'u';
b = load 'v';
c = a%.%b
store c into 'matrix-mult';

Is this the right thread for that conversation?

> Interactive Scala & Spark Bindings Shell & Script processor
> -----------------------------------------------------------
> 
>                Key: MAHOUT-1489
>                URL: https://issues.apache.org/jira/browse/MAHOUT-1489
>            Project: Mahout
>         Issue Type: New Feature
>   Affects Versions: 1.0
>           Reporter: Saikat Kanjilal
>           Assignee: Dmitriy Lyubimov
>            Fix For: 1.0
> 
> 
> Build an interactive shell /scripting (just like spark shell). Something very 
> similar in R interactive/script runner mode.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: [jira] [Commented] (MAHOUT-1489) Interactive Scala & Spark Bindings Shell & Script processor

Reply via email to