from:"sranga"

1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-27 Thread sranga

Hi

I am getting the following error when persisting an RDD in parquet format to
an S3 location. This is code that was working in the 1.2 version. The
version that it is failing to work is 1.3.1.
Any help is appreciated. 

Caused by: java.lang.AssertionError: assertion failed: Conflicting partition
column names detected:
ArrayBuffer(batch_id)
ArrayBuffer()
at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.sql.parquet.ParquetRelation2$.resolvePartitions(newParquet.scala:933)
at
org.apache.spark.sql.parquet.ParquetRelation2$.parsePartitions(newParquet.scala:851)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:311)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:303)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:303)
at
org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:692)
at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129)
at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
at
org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:995)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-3-1-Persisting-RDD-in-parquet-Conflicting-partition-column-names-tp22678.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming: HiveContext within Custom Actor

2014-12-29 Thread sranga

Hi

Could Spark-SQL be used from within a custom actor that acts as a receiver
for a streaming application? If yes, what is the recommended way of passing
the SparkContext to the actor? 
Thanks for your help.  


- Ranga



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RDD Cache Cleanup

2014-11-26 Thread sranga

Just to close out this one, I noticed that the cache partition size was quite
low for each of the RDDs (1 - 14). Increasing the number of partitions
(~400) resolved this for me.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cache-Cleanup-tp19772p19908.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RDD Cache Cleanup

2014-11-25 Thread sranga

Hi

I am noticing that the RDDs that are persisted get cleaned up very quickly.
This usually happens in a matter of a few minutes. I tried setting a value
of 20 hours for the /spark.cleaner.ttl/ property and still get the same
behavior.
In my use-case, I have to persist about 20 RDDs each of size 10 GB. There is
enough memory available (around 1 TB). The /spark.storage.memoryFraction/
property is set at 0.7. 
How does the cleanup work? Any help is appreciated.


- Ranga



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cache-Cleanup-tp19771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RDD C

2014-11-25 Thread sranga

Hi 

I am noticing that the RDDs that are persisted get cleaned up very quickly.
This usually happens in a matter of a few minutes. I tried setting a value
of 20 hours for the /spark.cleaner.ttl/ property and still get the same
behavior. 
In my use-case, I have to persist about 20 RDDs each of size 10 GB. There is
enough memory available (around 1 TB). The /spark.storage.memoryFraction/
property is set at 0.7. 
How does the cleanup work? Any help is appreciated. 


- Ranga



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-C-tp19782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark-Shell: OOM: GC overhead limit exceeded

2014-10-08 Thread sranga

Increasing the driver memory resolved this issue. Thanks to Nick for the
hint. Here is how I am starting the shell: spark-shell --driver-memory 4g
--driver-cores 4 --master local



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Shell-OOM-GC-overhead-limit-exceeded-tp15890p15940.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark-Shell: OOM: GC overhead limit exceeded

2014-10-07 Thread sranga

Hi

I am new to Spark and trying to develop an application that loads data from
Hive. Here is my setup:
* Spark-1.1.0 (built using -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-Phive)
* Executing Spark-shell on a box with 16 GB RAM
* 4 Cores Single Processor
* OpenCSV library (SerDe)
* Hive table has 100K records

While trying to execute a query that does a group-by (select ... group by
...) on a hive table, I get an OOM error. I tried setting the following
parameters, but they don't seem to help:
spark.executor.memory  2g
spark.shuffle.memoryFraction  0.8
spark.storage.memoryFraction  0.1
spark.default.parallelism 24

Any help is appreciated. The stack trace of the error is given below.


- Ranga

== Stack trace ==
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
at java.lang.StringBuffer.append(StringBuffer.java:369)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at au.com.bytecode.opencsv.CSVReader.getNextLine(CSVReader.java:266)
at au.com.bytecode.opencsv.CSVReader.readNext(CSVReader.java:233)
at com.bizo.hive.serde.csv.CSVSerde.deserialize(CSVSerde.java:129)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:279)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:157)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Shell-OOM-GC-overhead-limit-exceeded-tp15890.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

1.3.1: Persisting RDD in parquet - Conflicting partition column names

Spark Streaming: HiveContext within Custom Actor

Re: RDD Cache Cleanup

RDD Cache Cleanup

RDD C

Re: Spark-Shell: OOM: GC overhead limit exceeded

Spark-Shell: OOM: GC overhead limit exceeded

7 matches

Site Navigation

Mail list logo

Footer information