Thank you, Avery, wish I had found the bug earlier. Am 14.04.2013 23:25 schrieb "Avery Ching" <[email protected]>:
> Thanks for your input Sebastian. Given the choice to removing > PageRankVertex or adding the fix, I've added your fix and will cut RC2 a > bit later today. I really hope this is the last RC. > > Avery > > On 4/14/13 9:34 AM, Sebastian Schelter wrote: > >> Hi Avery, >> >> I see your concerns. The benchmarking question is difficult, we had very >> bad experiences with Mahout in that regards. E.g., we once had a >> M/R-based PageRank implementation in Mahout that uses our integer-based >> vectors and removed it as we got public complaints that you can't fit >> the whole web into the range of an integer. Personally, I'd also refrain >> from using floats instead of doubles for benchmarks, as this simply >> means you give up on accuracy. >> >> Regarding benchmarks, I guess the best thing we could do is publish our >> own numbers. The current runtimes I've seen are already very good, >> Giraph beat a very optimized Stratosphere implementation that we did for >> a recent paper by approx. 25%. >> >> To conclude, I do in no way want to hold up the current release. I'm >> perfectly fine with not including the patch and optimizing the >> implementation for a 1.0.1 release, but then we should remove the >> current examples.PageRankVertex from the 1.0 release, as the convergence >> detection is broken and we should not knowingly ship bugged code. >> >> Best, >> Sebastian >> >> >> On 14.04.2013 18:18, Avery Ching wrote: >> >>> Hi Sebastian, >>> >>> Thanks for the patch. I'll try to take a look at it. >>> >>> The only reason I bring the optimizations up is that a lot of folks tend >>> to compare PageRank performance. The optimizations I'm referring to are >>> Giraph ones, not algorithmic ones. We use ints, floats for ids, >>> messages, respectively instead longs, doubles (1/2 network traffic) and >>> IntNullArrayEdges vertex edges (efficient array backed edges) instead of >>> ByteArrayEdges. You can see >>> https://issues.apache.org/**jira/browse/giraph-543<https://issues.apache.org/jira/browse/giraph-543>for >>> more details. >>> >>> Anyway, given that we are going to ship a 1.0.1 release in a few weeks >>> for a variety of reasons, should this really hold up the current >>> release? I would prefer to not cut anymore RCs unless things are >>> totally broken (i.e. profiles not compiling, major Giraph bugs, etc.). >>> There are still a lot of outstanding issues in JIRA, we can't fix them >>> all for the 1.0 release. >>> >>> Let me know what you think. >>> >>> Avery >>> >>> On 4/13/13 10:46 AM, Sebastian Schelter wrote: >>> >>>> Hi Avery, >>>> >>>> I found the bug and can I provide a patch today or tomorrow, so >>>> hopefully we can include that in the release (to not knowingly ship >>>> bugged code). Furthermore I improved the code to protect against >>>> rounding errors. >>>> >>>> I don't really get what you mean with the missing optimization in >>>> comparison to the benchmark PageRank implementation. >>>> >>>> The implementation in o.a.g.examples.PageRankVertex aims to be a robust >>>> real-world implementation. As optimization, it dismisses edge weights >>>> and reuses objects where possible. Furthermore it is able to handle >>>> dangling vertices that are present in almost every real-world network >>>> and it automatically detects the number of supersteps to run. With the >>>> patch, it should also provide improved numerical stability. >>>> >>>> If the runtimes doesn't look good enough when compared to the benchmark >>>> implementation, this might also be caused by the dataset which has a >>>> skewed degree distribution (like most real-world networks). The >>>> benchmark uses a uniform degree distribution AFAIK. >>>> >>>> Best, >>>> Sebastian >>>> >>>> On 13.04.2013 15:46, Avery Ching wrote: >>>> >>>>> That's great Sebastian. I would also recommend taking a look at the >>>>> PageRankBenchmark for a performance comparison. It has been a lot of >>>>> speed improvements that should be a bunch faster than PageRankVertex. >>>>> Even that though, is not totally optimized. Hopefully we'll be adding >>>>> a >>>>> "how to optimize performance" guide in the near future. Should we >>>>> delay >>>>> the release or simply just ship a 1.1, say in the next month with this >>>>> fix and supporting YARN's 2.0.4? I'd like to get on a more normal >>>>> release cycle rather than once a year =). >>>>> >>>>> Avery >>>>> >>>>> On 4/13/13 3:02 AM, Sebastian Schelter wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> I got some good and bad news, I tested PageRankVertex (not the >>>>>> Benchmark >>>>>> but the example implementation o.a.g.examples.PageRankVertex) from >>>>>> trunk >>>>>> compiled for Hadoop 1.0 on a cluster of 26 machines with 208 cores. >>>>>> >>>>>> I used the Webbase2001 dataset [1] which has 115M vertices and more >>>>>> than >>>>>> 1B edges and got some awesome running times, average superstep takes >>>>>> 15 >>>>>> seconds (!!!). Awesome work, I have to say! >>>>>> >>>>>> Unfortunately, there seems to be an issue with the convergence >>>>>> detection, as it didn't get the correct convergence behavior. I'd like >>>>>> to have a look into that this week, so we can ship a performant >>>>>> PageRank >>>>>> implementation which automatically runs an appropriate number of >>>>>> supersteps. Hope this doesn't delay the release too much. >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>> >>>>>> [1] >>>>>> http://law.di.unimi.it/**webdata/webbase-2001/<http://law.di.unimi.it/webdata/webbase-2001/> >>>>>> >>>>>> >>>>>> On 13.04.2013 07:39, Avery Ching wrote: >>>>>> >>>>>>> Thanks to the quick feedback from Roman and Lewis, we have cut a >>>>>>> new RC1 >>>>>>> that addresses the following issues. >>>>>>> >>>>>>> * Got rid of .git repo in tarball >>>>>>> * Fixed issue with not compiling without git repo (GIRAPH-628) >>>>>>> * Used gnutar in OSX rather than tar to generate the tarball and >>>>>>> get rid >>>>>>> of warnings >>>>>>> * Pushed GIRAPH-627 to support the yarn profile better >>>>>>> * Tarball name changed to the final artifact name (giraph-1.0.tar.gz) >>>>>>> >>>>>>> Release notes: >>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/RELEASE_** >>>>>>> NOTES.html<http://people.apache.org/~aching/giraph-1.0-RC1/RELEASE_NOTES.html> >>>>>>> >>>>>>> Release artifacts: >>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/<http://people.apache.org/~aching/giraph-1.0-RC1/> >>>>>>> >>>>>>> Corresponding git tag: >>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=** >>>>>>> shortlog;h=refs/tags/release-**1.0-RC1<https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC1> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Signing keys: >>>>>>> http://people.apache.org/keys/**group/giraph.asc<http://people.apache.org/keys/group/giraph.asc> >>>>>>> >>>>>>> The vote runs for 72 hours, until Monday 11pm PST. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Avery >>>>>>> >>>>>>> Original message below regarding rc0: >>>>>>> >>>>>>> ------------------------------**- >>>>>>> >>>>>>> Fellow Giraphers, >>>>>>> >>>>>>> We have a our first release candidate since graduating from >>>>>>> incubation. >>>>>>> This is a source release, primarily due to the different >>>>>>> versions of >>>>>>> Hadoop we support with munge (similar to the 0.1 release). Since >>>>>>> 0.1, >>>>>>> we've made A TON of progress on overall performance, optimizing >>>>>>> memory >>>>>>> use, split vertex/edge inputs, easy interoperability with Apache >>>>>>> Hive, >>>>>>> and a bunch of other areas. In many ways, this is an almost totally >>>>>>> different codebase. Thanks everyone for your hard work! >>>>>>> >>>>>>> Apache Giraph has been running in production at Facebook (against >>>>>>> Facebook's Corona implementation of Hadoop - >>>>>>> https://github.com/facebook/**hadoop-20/tree/master/src/** >>>>>>> contrib/corona<https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona> >>>>>>> ) >>>>>>> since around last December. It has proven to be very scalable, >>>>>>> performant, and enables a bunch of new applications. Based on the >>>>>>> drastic improvements and the use of Giraph in production, it seems >>>>>>> appropriate to bump up our version to 1.0. >>>>>>> >>>>>>> While anyone can vote, the ASF requires majority approval from the >>>>>>> PMC >>>>>>> -- i.e., at least three PMC members must vote affirmatively for >>>>>>> release, >>>>>>> and there must be more positive than negative votes. Releases may >>>>>>> not be >>>>>>> vetoed. Before voting +1 PMC members are required to download the >>>>>>> signed >>>>>>> source code package, compile it as provided, and test the resulting >>>>>>> executable on their own platform, along with also verifying that the >>>>>>> package meets the requirements of the ASF policy on releases. >>>>>>> >>>>>>> Please test this against many other Hadoop versions and let us know >>>>>>> how >>>>>>> this goes! >>>>>>> >>>>>>> Release notes: >>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/RELEASE_** >>>>>>> NOTES.html<http://people.apache.org/~aching/giraph-1.0-RC0/RELEASE_NOTES.html> >>>>>>> >>>>>>> Release artifacts: >>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/<http://people.apache.org/~aching/giraph-1.0-RC0/> >>>>>>> >>>>>>> Corresponding git tag: >>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=** >>>>>>> shortlog;h=refs/tags/release-**1.0-RC0<https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC0> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Signing keys: >>>>>>> http://people.apache.org/keys/**group/giraph.asc<http://people.apache.org/keys/group/giraph.asc> >>>>>>> >>>>>>> The vote runs for 72 hours, until Monday 4pm PST. >>>>>>> >>>>>>> Thanks everyone for your patience with this release! >>>>>>> >>>>>>> Avery >>>>>>> >>>>>> >
