Hello, Apologies, but there is one correction. Where I say "TinkerPop 3.1.2 will be this fast too" -- that is not right. I forgot that the GryoSerializer for SparkGraphComputer was only updated for 3.2.0. Thus, TinkerPop 3.1.2 should have speeds somewhere between 3.1.1 and 3.2.0 (leaning more towards 3.2.0 speeds).
Thanks, Marko. http://markorodriguez.com On Feb 9, 2016, at 3:31 PM, Marko Rodriguez <okramma...@gmail.com> wrote: > Hi, > > Two tickets were recently completed. > https://issues.apache.org/jira/browse/TINKERPOP-1131 (TinkerPop > 3.1.2-SNAPSHOT & TinkerPop 3.2.0-SNAPSHOT) > https://issues.apache.org/jira/browse/TINKERPOP-962 (TinkerPop > 3.2.0-SNAPSHOT) > - with updates to serialization as well in this push. > > With these merged, I benchmarked SparkGraphComputer against Friendster (2.5 > billion edges) for the following queries: > > g.V().count() -- answer 125000000 (125 million vertices) > - TinkerPop 3.0.0.MX: 2.5 hours > - TinkerPop 3.0.0: 1.5 hours > - TinkerPop 3.1.1: 23 minutes > - TinkerPop 3.2.0: 6.8 minutes > > g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. > edges)) > - TinkerPop 3.0.0.MX: unknown > - TinkerPop 3.0.0: 2.5 hours > - TinkerPop 3.1.1: 1.1 hours > - TinkerPop 3.2.0: 13 minutes (*** TinkerPop 3.1.2 will be this > fast too) > > g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths) > - TinkerPop 3.0.0.MX: unknown > - TinkerPop 3.0.0: unknown > - TinkerPop 3.1.1: unknown > - TinkerPop 3.2.0: 55 minutes (*** TinkerPop 3.1.2 will be this > fast too) > > g.V().out().out().out().count() -- answer 215664338057221 (215 trillion > length 3-paths) > - TinkerPop 3.0.0.MX: 12.8 hours > - TinkerPop 3.0.0: 8.6 hours > - TinkerPop 3.1.1: 2.4 hours > - TinkerPop 3.2.0: 1.6 hours (*** TinkerPop 3.1.2 will be this > fast too) > > For SparkGraphComputer, I no longer have to use DISK_ONLY because the memory > optimizations have greatly reduced heap usage and thus, I can do > MEMORY_AND_DISK_SER w/o causing the GC to go crazy. Moreover, because of > TINKERPOP-1131, ReducingBarrierSteps (e.g. groupCount(), count(), sum(), > max(), etc.) are significantly faster and use a minuscule amount of memory. > Together, these updates have greatly improved GraphComputer as you can see > specifically with the SparkGraphComputer benchmark above. > > Finally, check this out. I decided to test the speed of g.V().count() when > the input graph is already partitioned to the Spark cluster. This will be > what you see when you use PersistedOutputRDD/InputRDD or when you use a graph > system that provides a Partitioner to their InputRDD and thus, avoids an > initial partition by SparkGraphComputer. > > g.V().count() -- answer 125000000 (125 million vertices) > - TinkerPop 3.2.0: 5.2 minutes > … hmm, not as good as I was hoping. I thought this would be around 1-2 > minutes. :| I bet there is something I'm doing wrong. > > Enjoy!, > Marko. > > http://markorodriguez.com >