Re: SchemaRDD partition on specific column values?
Hi Michael, I have opened following JIRA for the same :- https://issues.apache.org/jira/browse/SPARK-4849 I am having a look at the code to see what can be done and then we can have a discussion over the approach. Let me know if you have any comments/suggestions. Thanks -Nitin On Sun, Dec 14, 2014 at 2:53 PM, Michael Armbrust wrote: > > I'm happy to discuss what it would take to make sure we can propagate this > information correctly. Please open a JIRA (and mention me in it). > > Regarding including it in 1.2.1, it depends on how invasive the change > ends up being, but it is certainly possible. > > On Thu, Dec 11, 2014 at 3:55 AM, nitin wrote: >> >> Can we take this as a performance improvement task in Spark-1.2.1? I can >> help >> contribute for this. >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> -- Regards Nitin Goyal
Re: SchemaRDD partition on specific column values?
I'm happy to discuss what it would take to make sure we can propagate this information correctly. Please open a JIRA (and mention me in it). Regarding including it in 1.2.1, it depends on how invasive the change ends up being, but it is certainly possible. On Thu, Dec 11, 2014 at 3:55 AM, nitin wrote: > > Can we take this as a performance improvement task in Spark-1.2.1? I can > help > contribute for this. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: SchemaRDD partition on specific column values?
Can we take this as a performance improvement task in Spark-1.2.1? I can help contribute for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD partition on specific column values?
It does not appear that the in-memory caching currently preserves the information about the partitioning of the data so this optimization will probably not work. On Thu, Dec 4, 2014 at 8:42 PM, nitin wrote: > With some quick googling, I learnt that I can we can provide "distribute by > " in hive ql to distribute data based on a column values. My > question now if I use "distribute by id", will there be any performance > improvements? Will I be able to avoid data movement in shuffle(Excahnge > before JOIN step) and improve overall performance? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20424.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: SchemaRDD partition on specific column values?
With some quick googling, I learnt that I can we can provide "distribute by " in hive ql to distribute data based on a column values. My question now if I use "distribute by id", will there be any performance improvements? Will I be able to avoid data movement in shuffle(Excahnge before JOIN step) and improve overall performance? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20424.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org