1.3.1: Persisting RDD in parquet - Conflicting partition column names
Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2 version. The version that it is failing to work is 1.3.1. Any help is appreciated. Caused by: java.lang.AssertionError: assertion failed: Conflicting partition column names detected: ArrayBuffer(batch_id) ArrayBuffer() at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.parquet.ParquetRelation2$.resolvePartitions(newParquet.scala:933) at org.apache.spark.sql.parquet.ParquetRelation2$.parsePartitions(newParquet.scala:851) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:311) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:303) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:303) at org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:692) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196) at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:995) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-3-1-Persisting-RDD-in-parquet-Conflicting-partition-column-names-tp22678.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming: HiveContext within Custom Actor
Hi Could Spark-SQL be used from within a custom actor that acts as a receiver for a streaming application? If yes, what is the recommended way of passing the SparkContext to the actor? Thanks for your help. - Ranga -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD Cache Cleanup
Just to close out this one, I noticed that the cache partition size was quite low for each of the RDDs (1 - 14). Increasing the number of partitions (~400) resolved this for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cache-Cleanup-tp19772p19908.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RDD Cache Cleanup
Hi I am noticing that the RDDs that are persisted get cleaned up very quickly. This usually happens in a matter of a few minutes. I tried setting a value of 20 hours for the /spark.cleaner.ttl/ property and still get the same behavior. In my use-case, I have to persist about 20 RDDs each of size 10 GB. There is enough memory available (around 1 TB). The /spark.storage.memoryFraction/ property is set at 0.7. How does the cleanup work? Any help is appreciated. - Ranga -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cache-Cleanup-tp19771.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RDD C
Hi I am noticing that the RDDs that are persisted get cleaned up very quickly. This usually happens in a matter of a few minutes. I tried setting a value of 20 hours for the /spark.cleaner.ttl/ property and still get the same behavior. In my use-case, I have to persist about 20 RDDs each of size 10 GB. There is enough memory available (around 1 TB). The /spark.storage.memoryFraction/ property is set at 0.7. How does the cleanup work? Any help is appreciated. - Ranga -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-C-tp19782.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-Shell: OOM: GC overhead limit exceeded
Increasing the driver memory resolved this issue. Thanks to Nick for the hint. Here is how I am starting the shell: spark-shell --driver-memory 4g --driver-cores 4 --master local -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Shell-OOM-GC-overhead-limit-exceeded-tp15890p15940.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark-Shell: OOM: GC overhead limit exceeded
Hi I am new to Spark and trying to develop an application that loads data from Hive. Here is my setup: * Spark-1.1.0 (built using -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive) * Executing Spark-shell on a box with 16 GB RAM * 4 Cores Single Processor * OpenCSV library (SerDe) * Hive table has 100K records While trying to execute a query that does a group-by (select ... group by ...) on a hive table, I get an OOM error. I tried setting the following parameters, but they don't seem to help: spark.executor.memory 2g spark.shuffle.memoryFraction 0.8 spark.storage.memoryFraction 0.1 spark.default.parallelism 24 Any help is appreciated. The stack trace of the error is given below. - Ranga == Stack trace == java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569) at java.lang.StringBuffer.append(StringBuffer.java:369) at java.io.BufferedReader.readLine(BufferedReader.java:370) at java.io.BufferedReader.readLine(BufferedReader.java:389) at au.com.bytecode.opencsv.CSVReader.getNextLine(CSVReader.java:266) at au.com.bytecode.opencsv.CSVReader.readNext(CSVReader.java:233) at com.bizo.hive.serde.csv.CSVSerde.deserialize(CSVSerde.java:129) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:279) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:157) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Shell-OOM-GC-overhead-limit-exceeded-tp15890.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org