Spark repartition question...

2017-04-30 Thread Muthu Jayakumar
Hello there, I am trying to understand the difference between the following reparition()... a. def repartition(partitionExprs: Column*): Dataset[T] b. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] c. def repartition(numPartitions: Int): Dataset[T] My understanding is

Re: Repartition question

2015-08-04 Thread Richard Marscher
Hi, it is possible to control the number of partitions for the RDD without calling repartition by setting the max split size for the hadoop input format used. Tracing through the code, XmlInputFormat extends FileInputFormat which determines the number of splits (which NewHadoopRdd uses to

Repartition question

2015-08-03 Thread Naveen Madhire
Hi All, I am running the WikiPedia parsing example present in the Advance Analytics with Spark book. https://github.com/sryza/aas/blob/d3f62ef3ed43a59140f4ae8afbe2ef81fc643ef2/ch06-lsa/src/main/scala/com/cloudera/datascience/lsa/ParseWikipedia.scala#l112 The partitions of the RDD returned by