Pat,

I was thinking of something like:
https://github.com/gcapan/mahout/compare/cellin

It's just an example of where I believe new input formats should go (the
example is to input a DRM from a text file of <row_id,col_id,value> lines).

Best


Gokhan


On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <[email protected]> wrote:

> Some work on this is being done as part of MAHOUT-1568, which is currently
> very early and in https://github.com/apache/mahout/pull/36
>
> The idea there only covers text-delimited files and proposes a standard
> DRM-ish format but supports a configurable schema. Default is:
>
> rowID<tab>itemID1:value1<space>itemID2:value2…
>
> The IDs can be mahout keys of any type since they are written as text or
> they can be application specific IDs meaningful in a particular usage, like
> a user ID hash, or SKU from a catalog, or URL.
>
> As far as dataframe-ish requirements, it seems to me there are two
> different things needed. The dataframe is needed while preforming an
> algorithm or calculation and is kept in distributed data structures. There
> probably won’t be a lot of files kept around with the new engines. Any text
> files can be used for pipelines in a pinch but generally would be for
> import/export. Therefore MAHOUT-1568 concentrates on import/export not
> dataframes, though it could use them when they are ready.
>
>
> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <[email protected]>
> wrote:
>
> I believe the next step should be standardizing minimal Matrix I/O
> capability (i.e. a couple file formats other than [row_id, VectorWritable]
> SequenceFiles) required for a distributed computation engine, and adding
> data frame like structures those allow text columns.
>
>
>

Reply via email to