Thiis is great. We should definitely talk. What I’ve done is first cut and a 
data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but it 
also maintains dictionaries so external IDs can be preserved and re-attached 
when written, after any math or algo is done. It also has driver and option 
processing stuff.

No hard-coded “,”, you’d get that by using the default file schema but the user 
can change it if they want. This is especially useful for using existing files 
like log files as input, where appropriate. It’s also the beginnings of writing 
to DBs since the Schema class is pretty flexible it can contain DB connections 
and schema info. Was planning to put some in an example dir. I need Mongo but 
have also done Cassandra in a previous life.

I like some of your nomenclature better and agree that cells and DRMs are the 
primary data types to read. I am working on reading DRMs now for a Spark RSJ 
(1541 is itemsimilarity) So I may use part of your code but add the schema to 
it and use dictionaries to preserve application specific IDs. It’s tied to RDD 
textFile so is parallel for input and output.

MAHOUT-1541 is already merged, maybe we can find a way to get this stuff 
together. 

Thanks to Comcast I only have internet in Starbucks so be patient. 

On Aug 4, 2014, at 1:30 AM, Gokhan Capan <[email protected]> wrote:

Pat,

I was thinking of something like:
https://github.com/gcapan/mahout/compare/cellin

It's just an example of where I believe new input formats should go (the
example is to input a DRM from a text file of <row_id,col_id,value> lines).

Best


Gokhan


On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <[email protected]> wrote:

> Some work on this is being done as part of MAHOUT-1568, which is currently
> very early and in https://github.com/apache/mahout/pull/36
> 
> The idea there only covers text-delimited files and proposes a standard
> DRM-ish format but supports a configurable schema. Default is:
> 
> rowID<tab>itemID1:value1<space>itemID2:value2…
> 
> The IDs can be mahout keys of any type since they are written as text or
> they can be application specific IDs meaningful in a particular usage, like
> a user ID hash, or SKU from a catalog, or URL.
> 
> As far as dataframe-ish requirements, it seems to me there are two
> different things needed. The dataframe is needed while preforming an
> algorithm or calculation and is kept in distributed data structures. There
> probably won’t be a lot of files kept around with the new engines. Any text
> files can be used for pipelines in a pinch but generally would be for
> import/export. Therefore MAHOUT-1568 concentrates on import/export not
> dataframes, though it could use them when they are ready.
> 
> 
> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <[email protected]>
> wrote:
> 
> I believe the next step should be standardizing minimal Matrix I/O
> capability (i.e. a couple file formats other than [row_id, VectorWritable]
> SequenceFiles) required for a distributed computation engine, and adding
> data frame like structures those allow text columns.
> 
> 
> 

Reply via email to