> > I've written the M/R job in DistributedRowMatrix to do transpose, but our > document > matrixes produced by SparseVectorsFromSequenceFiles don't have a > integer-valued > keys for the rows, so transpose doesn't yet make sense. Fooey. More work > to do. > > I suppose its a simple enough M/R job. See the MeanShiftCanopyCreator in MAHOUT-307, we will M/R over the dataset and assign ids based on the map attempt and output the int id => vector (vector itself has the documents ids anyways)
My mind was wandering and was thinking of giving the record attempt a better purpose than just creating junk ngram data(its good enough for a record attempt) There are a couple of datasets we can explore, like the genome dataset. http://aws.amazon.com/publicdatasets/ all these are 150-200GB datasets or there is the wikipedia edits dataset (1TB+) which has all versions of all the documents - *Annotated Human Genome Data provided by ENSEMBL* <http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2315>The Ensembl project produces genome databases for human as well as almost 50 other species, and makes this information freely available. - *Various US Census Databases from The US Census Bureau* <http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=248>United States demographic data from the 1980, 1990, and 2000 US Censuses, summary information about Business and Industry , and 2003-2006 Economic Household Profile Data. - *UniGene provided by the National Center for Biotechnology Information * <http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2283>A set of transcript sequences of well-characterized genes and hundreds of thousands of expressed sequence tags (EST) that provide an organized view of the transcriptome. - *Freebase Data Dump from Freebase.com* <http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2320>A data dump of all the current facts and assertions in the Freebase system. Freebase <http://www.freebase.com> is an open database of the world’s information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available. Any thoughts on these. Say for freebase, we can generate topic -> item matrix? Robin