Hi Jem, Linear time in scaling on the big table doesn't seem that surprising to me. What were you expecting?
I assume you're doing normalRDD.join(indexedRDD). If you were to replace the indexedRDD with a normal RDD, what times do you get? On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker <jem.tuc...@gmail.com> wrote: > Hi, > > I have been playing around with the indexedRDD ( > https://issues.apache.org/jira/browse/SPARK-2365, > https://github.com/amplab/spark-indexedrdd) and have been very impressed > with its performance. Some performance testing has revealed worse than > expected scaling of the join performance*, and I was just wondering if > anyone else has any experience using it and what they have found? > > Thanks, > > Jem > > *Table below shows some of my results when joining a small RDD to a large > IndexedRDD. Each table consisted of a Long key and 15 character String > value. Shows an almost linear time increase with the number of rows in the > bigger table. > > Small Table Rows > > Big Table Rows > > Time > > (s) > > 50000 > > 10000000 > > 0.6 > > 50000 > > 50000000 > > 0.8 > > 50000 > > 100000000 > > 1.5 > > 50000 > > 150000000 > > 2.1 > > 50000 > > 200000000 > > 2.8 > > 50000 > > 500000000 > > 7.2 > > 50000 > > 1000000000 > > 12.2 >