Some work on this is being done as part of MAHOUT-1568, which is currently very early and in https://github.com/apache/mahout/pull/36
The idea there only covers text-delimited files and proposes a standard DRM-ish format but supports a configurable schema. Default is: rowID<tab>itemID1:value1<space>itemID2:value2… The IDs can be mahout keys of any type since they are written as text or they can be application specific IDs meaningful in a particular usage, like a user ID hash, or SKU from a catalog, or URL. As far as dataframe-ish requirements, it seems to me there are two different things needed. The dataframe is needed while preforming an algorithm or calculation and is kept in distributed data structures. There probably won’t be a lot of files kept around with the new engines. Any text files can be used for pipelines in a pinch but generally would be for import/export. Therefore MAHOUT-1568 concentrates on import/export not dataframes, though it could use them when they are ready. > On Jul 30, 2014, at 7:53 AM, Gokhan Capan <[email protected]> wrote: > I believe the next step should be standardizing minimal Matrix I/O capability > (i.e. a couple file formats other than [row_id, VectorWritable] > SequenceFiles) required for a distributed computation engine, and adding data > frame like structures those allow text columns. >
