Re: standardizing minimal Matrix I/O capability

Pat Ferrel Tue, 05 Aug 2014 09:08:39 -0700

Oh, and how about calling a single value from a matrix an "Element" as we do in 
Vector.Element? This only applies to naming the reader functions "readElements" 
or some derivative.


Sent from my iPhone

> On Aug 5, 2014, at 8:34 AM, Pat Ferrel <[email protected]> wrote:
> 
> The benefit of your read/write is that there are no dictionaries to take up 
> memory. This is an optimization that I haven’t done yet. The purpose of mine 
> was specifically to preserve external/non-Mahout IDs. So yours is more like 
> drm.writeDrm, which writes seqfiles (also sc.readDrm). 
> 
> The benefit of the stuff currently in mahout.drivers in the Spark module is 
> that even in a pipeline it will preserve external IDs or use Mahout 
> sequential Int keys as requested. The downside is that it requires a Schema, 
> though there are several default ones defined (in the PR) that would support 
> your exact use case. And it is not yet optimized for use without 
> dictionaries. 
> 
> How should we resolve the overlap. Pragmatically if you were to merge your 
> code I could call it in the case where I don’t need dictionaries, solving my 
> optimization issue but this will result in some duplicated code. Not sure if 
> this is a problem. Maybe if yours took a Schema, defaulted to the one the we 
> agree has the correct delimiters?
> 
> The stuff in drivers does not read a text drm yet. That will be part of 
> MAHOUT-1604
> 
> On Aug 4, 2014, at 8:32 AM, Pat Ferrel <[email protected]> wrote:
> 
> Thiis is great. We should definitely talk. What I’ve done is first cut and a 
> data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but 
> it also maintains dictionaries so external IDs can be preserved and 
> re-attached when written, after any math or algo is done. It also has driver 
> and option processing stuff.
> 
> No hard-coded “,”, you’d get that by using the default file schema but the 
> user can change it if they want. This is especially useful for using existing 
> files like log files as input, where appropriate. It’s also the beginnings of 
> writing to DBs since the Schema class is pretty flexible it can contain DB 
> connections and schema info. Was planning to put some in an example dir. I 
> need Mongo but have also done Cassandra in a previous life.
> 
> I like some of your nomenclature better and agree that cells and DRMs are the 
> primary data types to read. I am working on reading DRMs now for a Spark RSJ 
> (1541 is itemsimilarity) So I may use part of your code but add the schema to 
> it and use dictionaries to preserve application specific IDs. It’s tied to 
> RDD textFile so is parallel for input and output.
> 
> MAHOUT-1541 is already merged, maybe we can find a way to get this stuff 
> together. 
> 
> Thanks to Comcast I only have internet in Starbucks so be patient. 
> 
> On Aug 4, 2014, at 1:30 AM, Gokhan Capan <[email protected]> wrote:
> 
> Pat,
> 
> I was thinking of something like:
> https://github.com/gcapan/mahout/compare/cellin
> 
> It's just an example of where I believe new input formats should go (the
> example is to input a DRM from a text file of <row_id,col_id,value> lines).
> 
> Best
> 
> 
> Gokhan
> 
> 
>> On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <[email protected]> wrote:
>> 
>> Some work on this is being done as part of MAHOUT-1568, which is currently
>> very early and in https://github.com/apache/mahout/pull/36
>> 
>> The idea there only covers text-delimited files and proposes a standard
>> DRM-ish format but supports a configurable schema. Default is:
>> 
>> rowID<tab>itemID1:value1<space>itemID2:value2…
>> 
>> The IDs can be mahout keys of any type since they are written as text or
>> they can be application specific IDs meaningful in a particular usage, like
>> a user ID hash, or SKU from a catalog, or URL.
>> 
>> As far as dataframe-ish requirements, it seems to me there are two
>> different things needed. The dataframe is needed while preforming an
>> algorithm or calculation and is kept in distributed data structures. There
>> probably won’t be a lot of files kept around with the new engines. Any text
>> files can be used for pipelines in a pinch but generally would be for
>> import/export. Therefore MAHOUT-1568 concentrates on import/export not
>> dataframes, though it could use them when they are ready.
>> 
>> 
>> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <[email protected]>
>> wrote:
>> 
>> I believe the next step should be standardizing minimal Matrix I/O
>> capability (i.e. a couple file formats other than [row_id, VectorWritable]
>> SequenceFiles) required for a distributed computation engine, and adding
>> data frame like structures those allow text columns.
> 
>

Re: standardizing minimal Matrix I/O capability

Reply via email to