1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-27 Thread sranga
Hi

I am getting the following error when persisting an RDD in parquet format to
an S3 location. This is code that was working in the 1.2 version. The
version that it is failing to work is 1.3.1.
Any help is appreciated. 

Caused by: java.lang.AssertionError: assertion failed: Conflicting partition
column names detected:
ArrayBuffer(batch_id)
ArrayBuffer()
at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.sql.parquet.ParquetRelation2$.resolvePartitions(newParquet.scala:933)
at
org.apache.spark.sql.parquet.ParquetRelation2$.parsePartitions(newParquet.scala:851)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:311)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:303)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:303)
at
org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:692)
at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129)
at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
at
org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:995)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-3-1-Persisting-RDD-in-parquet-Conflicting-partition-column-names-tp22678.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming: HiveContext within Custom Actor

2014-12-29 Thread sranga
Hi

Could Spark-SQL be used from within a custom actor that acts as a receiver
for a streaming application? If yes, what is the recommended way of passing
the SparkContext to the actor? 
Thanks for your help.  


- Ranga



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD Cache Cleanup

2014-11-26 Thread sranga
Just to close out this one, I noticed that the cache partition size was quite
low for each of the RDDs (1 - 14). Increasing the number of partitions
(~400) resolved this for me.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cache-Cleanup-tp19772p19908.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RDD Cache Cleanup

2014-11-25 Thread sranga
Hi

I am noticing that the RDDs that are persisted get cleaned up very quickly.
This usually happens in a matter of a few minutes. I tried setting a value
of 20 hours for the /spark.cleaner.ttl/ property and still get the same
behavior.
In my use-case, I have to persist about 20 RDDs each of size 10 GB. There is
enough memory available (around 1 TB). The /spark.storage.memoryFraction/
property is set at 0.7. 
How does the cleanup work? Any help is appreciated.


- Ranga



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cache-Cleanup-tp19771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RDD C

2014-11-25 Thread sranga
Hi 

I am noticing that the RDDs that are persisted get cleaned up very quickly.
This usually happens in a matter of a few minutes. I tried setting a value
of 20 hours for the /spark.cleaner.ttl/ property and still get the same
behavior. 
In my use-case, I have to persist about 20 RDDs each of size 10 GB. There is
enough memory available (around 1 TB). The /spark.storage.memoryFraction/
property is set at 0.7. 
How does the cleanup work? Any help is appreciated. 


- Ranga



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-C-tp19782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark-Shell: OOM: GC overhead limit exceeded

2014-10-08 Thread sranga
Increasing the driver memory resolved this issue. Thanks to Nick for the
hint. Here is how I am starting the shell: spark-shell --driver-memory 4g
--driver-cores 4 --master local



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Shell-OOM-GC-overhead-limit-exceeded-tp15890p15940.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark-Shell: OOM: GC overhead limit exceeded

2014-10-07 Thread sranga
Hi

I am new to Spark and trying to develop an application that loads data from
Hive. Here is my setup:
* Spark-1.1.0 (built using -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-Phive)
* Executing Spark-shell on a box with 16 GB RAM
* 4 Cores Single Processor
* OpenCSV library (SerDe)
* Hive table has 100K records

While trying to execute a query that does a group-by (select ... group by
...) on a hive table, I get an OOM error. I tried setting the following
parameters, but they don't seem to help:
spark.executor.memory  2g
spark.shuffle.memoryFraction  0.8
spark.storage.memoryFraction  0.1
spark.default.parallelism 24

Any help is appreciated. The stack trace of the error is given below.


- Ranga

== Stack trace ==
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
at java.lang.StringBuffer.append(StringBuffer.java:369)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at au.com.bytecode.opencsv.CSVReader.getNextLine(CSVReader.java:266)
at au.com.bytecode.opencsv.CSVReader.readNext(CSVReader.java:233)
at com.bizo.hive.serde.csv.CSVSerde.deserialize(CSVSerde.java:129)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:279)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:157)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Shell-OOM-GC-overhead-limit-exceeded-tp15890.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org