You've convinced me that this is probably a bad idea. You never know when this might come back to bite later.
On 12 Dec, 2011, at 12:50 AM, Dmitriy Lyubimov wrote: > Oh now i remember what the deal with NullWritable was. > > yes sequence file would read it as in > > Configuration conf = new Configuration(); > FileSystem fs = FileSystem.getLocal(new Configuration()); > Path testPath = new Path("name.seq"); > > IntWritable iw = new IntWritable(); > SequenceFile.Writer w = > SequenceFile.createWriter(fs, > conf, > testPath, > NullWritable.class, > IntWritable.class); > w.append(NullWritable.get(),iw); > w.close(); > > > SequenceFile.Reader r = new SequenceFile.Reader(fs, testPath, conf); > while ( r.next(NullWritable.get(),iw)); > r.close(); > > > but SequenceFileInputFileFormat would not. I.e. it is ok if you read > it explicitly but I don't think one can use such files as an input for > other MR jobs. > > But since in this case there's no MR job to consume that output (and > unlikely ever will be) i guess it is ok to save NullWritable in this > case... > > -d > > On Mon, Dec 12, 2011 at 12:30 AM, jirapos...@reviews.apache.org > (Commented) (JIRA) <j...@apache.org> wrote: >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167406#comment-13167406 >> ] >> >> jirapos...@reviews.apache.org commented on MAHOUT-923: >> ------------------------------------------------------ >> >> >> >> bq. On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote: >> bq. > Hm. I hope i did not read the code or miss something. >> bq. > >> bq. > 1 -- i am not sure this will actually work as intended unless # of >> reducers is corced to 1, of which i see no mention in the code. >> bq. > 2 -- mappers do nothing, passing on all the row pressure to sort >> which is absolutely not necessary. Even if you use combiners. This is going >> to be especially the case if you coerce 1 reducer an no combiners. IMO mean >> computation should be pushed up to mappers to avoid sort pressures of map >> reduce. Then reduction becomes largely symbolical(but you do need pass on >> the # of rows mapper has seen, to the reducer, in order for that operation >> to apply correctly). >> bq. > 3 -- i am not sure -- is NullWritable as a key legit? In my >> experience sequence file reader cannot instantiate it because NullWritable >> is a singleton and its creation is prohibited by making constructor private. >> bq. >> bq. Raphael Cendrillon wrote: >> bq. Thanks Dmitry. >> bq. >> bq. Regarding 1, if I understand correctly the number of reducers >> depends on the number of unique keys. Since all keys are set to the same >> value (null), then all of the mapper outputs should arrive at the same >> reducer. This seems to work in the unit test, but I may be missing something? >> bq. >> bq. Regarding 2, that makes alot of sense. I'm wondering how many rows >> should be processed per mapper? I guess there is a trade-off between >> scalability (processing more rows within a single map job means that each >> row must have less columns) and speed? Is there someplace in the SSVD code >> where the matrix is split into slices of rows that I could use as a >> reference? >> bq. >> bq. Regarding 3, I believe NullWritable is OK. It's used pretty >> extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel >> there is some disadvantage to this I could replace "NullWritable.get()" with >> "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more >> suitable? >> bq. >> bq. >> >> NullWritable objection is withdrawn. Apparently i haven't looked into hadoop >> for too long, amazingly it seems to work now. >> >> >> 1 -- I don't think your statement about # of reduce tasks is true. >> >> The job (or, rather, user) sets the number of reduce tasks via config >> propery. All users will follow hadoop recommendation to set that to 95% of >> capacity they want to take. (usually the whole cluster). So in production >> environment you are virtually _guaranteed_ to have number of reducers of >> something like 75 on a 40-noder and consequently 75 output files (unless >> users really want to read the details of your job and figure you meant it to >> be just 1). >> Now, it is true that only one file will actually end up having something and >> the rest of task slots will just be occupied doing nothing . >> >> So there are two problems with that scheme: a) is that job that allocates so >> many task slots that do nothing is not a good citizen, since in real >> production cluster is always shared with multiple jobs. b) your code assumes >> result will end up in partition 0, whereas contractually it may end up in >> any of 75 files. (in reality with default hash partitioner for key 1 it will >> wind up in partion 0001 unless there's one reducer as i guess in your test >> was). >> >> 2-- it is simple. when you send n rows to reducers, they are shuffled - and >> - sorted. Sending massive sets to reducers has 2 effects: first, even if >> they all group under the same key, they are still sorted with ~ n log (n/p) >> where p is number of partitions assuming uniform distribution (which it is >> not because you are sending everything to the same place). Just because we >> can run distributed sort, doesn't mean we should. Secondly, all these rows >> are physically moved to reduce tasks, which is still ~n rows. Finally what >> has made your case especially problematic is that you are sending everything >> to the same reducer, i.e. you are not actually doing sort in distributed way >> but rather simple single threaded sort at the reducer that happens to get >> all the input. >> >> So that would allocate a lot of tasks slots that are not used; but do a sort >> that is not needed; and do it in a single reducer thread for the entire >> input which is not parallel at all. >> >> Instead, consider this: map has a state consisting of (sum(X), k). it keeps >> updating it sum+=x, k++ for every new x. At the end of the cycle (in >> cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced >> complexity of the sort and io from millions of elements to just # of maps >> (which is perhaps just handful and in reality rarely overshoots 500 >> mappers). That is, about at least 4 orders of magnitude. >> >> Now, we send that handful tuples to single reducer and just do combining >> (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it >> is only a handful, reducer also runs very quickly, so the fact that we >> coerced it to be 1, is pretty benign. That volume of anywhere between 1 to >> 500 vectors it sums up doesn't warrant distributed computation. >> >> But, you have to make sure there's only 1 reducer no matter what user put >> into the config, and you have to make sure you do all heavy lifting in the >> mappers. >> >> Finally, you don't even to coerce to 1 reducer. You still can have several >> (but uniformly distributed) and do final combine in front end of the method. >> However, given small size and triviality of the reduction processing, it is >> probably not warranted. Coercing to 1 reducer is ok in this case IMO. >> >> 3 i guess any writable is ok but NullWritable. Maybe something has changed. >> i remember falling into that pitfall several generations of hadoop ago. You >> can verify by staging a simple experiment of writing a sequence file with >> nullwritable as either key or value and try to read it back. in my test long >> ago it would write ok but not read back. I beleive similar approach is used >> with keys in shuffle and sort. There is a reflection writable factory inside >> which is trying to use default constructor of the class to bring it up which >> is(was) not available for NullWritable. >> >> >> - Dmitriy >> >> >> ----------------------------------------------------------- >> This is an automatically generated e-mail. To reply, visit: >> https://reviews.apache.org/r/3147/#review3838 >> ----------------------------------------------------------- >> >> >> On 2011-12-12 00:30:24, Raphael Cendrillon wrote: >> bq. >> bq. ----------------------------------------------------------- >> bq. This is an automatically generated e-mail. To reply, visit: >> bq. https://reviews.apache.org/r/3147/ >> bq. ----------------------------------------------------------- >> bq. >> bq. (Updated 2011-12-12 00:30:24) >> bq. >> bq. >> bq. Review request for mahout. >> bq. >> bq. >> bq. Summary >> bq. ------- >> bq. >> bq. Here's a patch with a simple job to calculate the row mean (column-wise >> mean). One outstanding issue is the combiner, this requires a wrtiable class >> IntVectorTupleWritable, where the Int stores the number of rows, and the >> Vector stores the column-wise sum. >> bq. >> bq. >> bq. This addresses bug MAHOUT-923. >> bq. https://issues.apache.org/jira/browse/MAHOUT-923 >> bq. >> bq. >> bq. Diffs >> bq. ----- >> bq. >> bq. >> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java >> 1213095 >> bq. >> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java >> PRE-CREATION >> bq. >> /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java >> 1213095 >> bq. >> bq. Diff: https://reviews.apache.org/r/3147/diff >> bq. >> bq. >> bq. Testing >> bq. ------- >> bq. >> bq. Junit test >> bq. >> bq. >> bq. Thanks, >> bq. >> bq. Raphael >> bq. >> bq. >> >> >> >>> Row mean job for PCA >>> -------------------- >>> >>> Key: MAHOUT-923 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-923 >>> Project: Mahout >>> Issue Type: Improvement >>> Components: Math >>> Affects Versions: 0.6 >>> Reporter: Raphael Cendrillon >>> Assignee: Raphael Cendrillon >>> Fix For: Backlog >>> >>> Attachments: MAHOUT-923.patch >>> >>> >>> Add map reduce job for calculating mean row (column-wise mean) of a >>> Distributed Row Matrix for use in PCA. >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA >> administrators: >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >>