I tried below options. 1) Increase executor memory. Increased up to maximum possibility 14GB. Same error. 2) Tried new version - spark-xml_2.10:0.4.1. Same error. 3) Tried with low level rowTags. It worked for lower level rowTag and returned 16000 rows.
Are there any workarounds for this issue? I tried playing with spark.memory.fraction and spark.memory.storageFraction. But, it did not help. Appreciate your help on this!!! On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel <arunp.bigd...@gmail.com> wrote: > Thanks for the quick response. > > Its a single XML file and I am using a top level rowTag. So, it creates > only one row in a Dataframe with 5 columns. One of these columns will > contain most of the data as StructType. Is there a limitation to store > data in a cell of a Dataframe? > > I will check with new version and try to use different rowTags and > increase executor-memory tomorrow. I will open a new issue as well. > > > > On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Hi Arun, >> >> >> I have few questions. >> >> Dose your XML file have like few huge documents? In this case of a row >> having a huge size like (like 500MB), it would consume a lot of memory >> >> becuase at least it should hold a row to iterate if I remember correctly. >> I remember this happened to me before while processing a huge record for >> test purpose. >> >> >> How about trying to increase --executor-memory? >> >> >> Also, you could try to select only few fields to prune the data with the >> latest version just to doubly sure if you don't mind?. >> >> >> Lastly, do you mind if I ask to open an issue in >> https://github.com/databricks/spark-xml/issues if you still face this >> problem? >> >> I will try to take a look at my best. >> >> >> Thank you. >> >> >> 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>: >> >>> I am trying to read an XML file which is 1GB is size. I am getting an >>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM >>> limit' after reading 7 partitions in local mode. In Yarn mode, it >>> throws 'java.lang.OutOfMemoryError: Java heap space' error after >>> reading 3 partitions. >>> >>> Any suggestion? >>> >>> PySpark Shell Command: pyspark --master local[4] --driver-memory 3G >>> --jars / tmp/spark-xml_2.10-0.3.3.jar >>> >>> >>> >>> Dataframe Creation Command: df = sqlContext.read.format('com.da >>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml') >>> >>> >>> >>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 >>> (TID 1) in 25978 ms on localhost (1/10) >>> >>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split: >>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728 >>> >>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). >>> 2309 bytes result sent to driver >>> >>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 >>> (TID 3, localhost, partition 3,ANY, 2266 bytes) >>> >>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3) >>> >>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 >>> (TID 2) in 51001 ms on localhost (2/10) >>> >>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split: >>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728 >>> >>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). >>> 2309 bytes result sent to driver >>> >>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 >>> (TID 4, localhost, partition 4,ANY, 2266 bytes) >>> >>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4) >>> >>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 >>> (TID 3) in 24336 ms on localhost (3/10) >>> >>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split: >>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728 >>> >>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). >>> 2309 bytes result sent to driver >>> >>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 >>> (TID 5, localhost, partition 5,ANY, 2266 bytes) >>> >>> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5) >>> >>> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 >>> (TID 4) in 20895 ms on localhost (4/10) >>> >>> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split: >>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728 >>> >>> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). >>> 2309 bytes result sent to driver >>> >>> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 >>> (TID 6, localhost, partition 6,ANY, 2266 bytes) >>> >>> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6) >>> >>> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 >>> (TID 5) in 20793 ms on localhost (5/10) >>> >>> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split: >>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728 >>> >>> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). >>> 2309 bytes result sent to driver >>> >>> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 >>> (TID 7, localhost, partition 7,ANY, 2266 bytes) >>> >>> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7) >>> >>> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 >>> (TID 6) in 21306 ms on localhost (6/10) >>> >>> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split: >>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728 >>> >>> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). >>> 2309 bytes result sent to driver >>> >>> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 >>> (TID 8, localhost, partition 8,ANY, 2266 bytes) >>> >>> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8) >>> >>> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 >>> (TID 7) in 21130 ms on localhost (7/10) >>> >>> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split: >>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728 >>> >>> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 >>> (TID 0) >>> >>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit >>> >>> at java.util.Arrays.copyOf(Arrays.java:2271) >>> >>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.jav >>> a:113) >>> >>> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput >>> Stream.java:93) >>> >>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.ja >>> va:122) >>> >>> at java.io.DataOutputStream.write(DataOutputStream.java:88) >>> >>> at com.databricks.spark.xml.XmlRecordReader.readUntilMatch(XmlI >>> nputFormat.scala:188) >>> >>> at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat >>> .scala:156) >>> >>> at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInp >>> utFormat.scala:141) >>> >>> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopR >>> DD.scala:168) >>> >>> at org.apache.spark.InterruptibleIterator.hasNext(Interruptible >>> Iterator.scala:39) >>> >>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:32 >>> 7) >>> >>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:37 >>> 1) >>> >>> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >>> >>> at scala.collection.AbstractIterator.foreach(Iterator.scala:115 >>> 7) >>> >>> at scala.collection.TraversableOnce$class.foldLeft(TraversableO >>> nce.scala:144) >>> >>> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:11 >>> 57) >>> >>> at scala.collection.TraversableOnce$class.aggregate(Traversable >>> Once.scala:201) >>> >>> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1 >>> 157) >>> >>> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >>> 4.apply(RDD.scala:1142) >>> >>> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >>> 4.apply(RDD.scala:1142) >>> >>> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >>> 5.apply(RDD.scala:1143) >>> >>> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >>> 5.apply(RDD.scala:1143) >>> >>> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a >>> pply$22.apply(RDD.scala:717) >>> >>> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a >>> pply$22.apply(RDD.scala:717) >>> >>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >>> DD.scala:38) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3 >>> 13) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) >>> >>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >>> DD.scala:38) >>> >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3 >>> 13) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) >>> >>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap >>> Task.scala:73) >>> >>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap >>> Task.scala:41) >>> >>> 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught >>> exception in thread Thread[Executor task launch worker-0,5,main] >>> >>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit >>> >>> >>> >> >