I tried below options.

1) Increase executor memory.  Increased up to maximum possibility 14GB.
Same error.
2) Tried new version - spark-xml_2.10:0.4.1.  Same error.
3) Tried with low level rowTags.  It worked for lower level rowTag and
returned 16000 rows.

Are there any workarounds for this issue?  I tried playing with
spark.memory.fraction
and spark.memory.storageFraction.  But, it did not help.  Appreciate your
help on this!!!



On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel <arunp.bigd...@gmail.com> wrote:

> Thanks for the quick response.
>
> Its a single XML file and I am using a top level rowTag.  So, it creates
> only one row in a Dataframe with 5 columns. One of these columns will
> contain most of the data as StructType.  Is there a limitation to store
> data in a cell of a Dataframe?
>
> I will check with new version and try to use different rowTags and
> increase executor-memory tomorrow. I will open a new issue as well.
>
>
>
> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Hi Arun,
>>
>>
>> I have few questions.
>>
>> Dose your XML file have like few huge documents? In this case of a row
>> having a huge size like (like 500MB), it would consume a lot of memory
>>
>> becuase at least it should hold a row to iterate if I remember correctly.
>> I remember this happened to me before while processing a huge record for
>> test purpose.
>>
>>
>> How about trying to increase --executor-memory?
>>
>>
>> Also, you could try to select only few fields to prune the data with the
>> latest version just to doubly sure if you don't mind?.
>>
>>
>> Lastly, do you mind if I ask to open an issue in
>> https://github.com/databricks/spark-xml/issues if you still face this
>> problem?
>>
>> I will try to take a look at my best.
>>
>>
>> Thank you.
>>
>>
>> 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>:
>>
>>> I am trying to read an XML file which is 1GB is size.  I am getting an
>>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
>>> limit' after reading 7 partitions in local mode.  In Yarn mode, it
>>> throws 'java.lang.OutOfMemoryError: Java heap space' error after
>>> reading 3 partitions.
>>>
>>> Any suggestion?
>>>
>>> PySpark Shell Command:    pyspark --master local[4] --driver-memory 3G
>>> --jars / tmp/spark-xml_2.10-0.3.3.jar
>>>
>>>
>>>
>>> Dataframe Creation Command:   df = sqlContext.read.format('com.da
>>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>>>
>>>
>>>
>>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
>>> (TID 1) in 25978 ms on localhost (1/10)
>>>
>>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>>>
>>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
>>> (TID 3, localhost, partition 3,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>>>
>>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
>>> (TID 2) in 51001 ms on localhost (2/10)
>>>
>>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>>>
>>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
>>> (TID 4, localhost, partition 4,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>>>
>>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0
>>> (TID 3) in 24336 ms on localhost (3/10)
>>>
>>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>>>
>>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0
>>> (TID 5, localhost, partition 5,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>>>
>>> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0
>>> (TID 4) in 20895 ms on localhost (4/10)
>>>
>>> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>>>
>>> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0
>>> (TID 6, localhost, partition 6,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>>>
>>> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0
>>> (TID 5) in 20793 ms on localhost (5/10)
>>>
>>> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728
>>>
>>> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0
>>> (TID 7, localhost, partition 7,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
>>>
>>> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0
>>> (TID 6) in 21306 ms on localhost (6/10)
>>>
>>> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728
>>>
>>> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0
>>> (TID 8, localhost, partition 8,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>>>
>>> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0
>>> (TID 7) in 21130 ms on localhost (7/10)
>>>
>>> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728
>>>
>>> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0
>>> (TID 0)
>>>
>>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>>>
>>>         at java.util.Arrays.copyOf(Arrays.java:2271)
>>>
>>>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.jav
>>> a:113)
>>>
>>>         at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput
>>> Stream.java:93)
>>>
>>>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.ja
>>> va:122)
>>>
>>>         at java.io.DataOutputStream.write(DataOutputStream.java:88)
>>>
>>>         at com.databricks.spark.xml.XmlRecordReader.readUntilMatch(XmlI
>>> nputFormat.scala:188)
>>>
>>>         at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat
>>> .scala:156)
>>>
>>>         at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInp
>>> utFormat.scala:141)
>>>
>>>         at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopR
>>> DD.scala:168)
>>>
>>>         at org.apache.spark.InterruptibleIterator.hasNext(Interruptible
>>> Iterator.scala:39)
>>>
>>>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:32
>>> 7)
>>>
>>>         at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:37
>>> 1)
>>>
>>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>
>>>         at scala.collection.AbstractIterator.foreach(Iterator.scala:115
>>> 7)
>>>
>>>         at scala.collection.TraversableOnce$class.foldLeft(TraversableO
>>> nce.scala:144)
>>>
>>>         at scala.collection.AbstractIterator.foldLeft(Iterator.scala:11
>>> 57)
>>>
>>>         at scala.collection.TraversableOnce$class.aggregate(Traversable
>>> Once.scala:201)
>>>
>>>         at scala.collection.AbstractIterator.aggregate(Iterator.scala:1
>>> 157)
>>>
>>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>>> 4.apply(RDD.scala:1142)
>>>
>>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>>> 4.apply(RDD.scala:1142)
>>>
>>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>>> 5.apply(RDD.scala:1143)
>>>
>>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>>> 5.apply(RDD.scala:1143)
>>>
>>>         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a
>>> pply$22.apply(RDD.scala:717)
>>>
>>>         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a
>>> pply$22.apply(RDD.scala:717)
>>>
>>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>>> DD.scala:38)
>>>
>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3
>>> 13)
>>>
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>>>
>>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>>> DD.scala:38)
>>>
>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3
>>> 13)
>>>
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>>>
>>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>>> Task.scala:73)
>>>
>>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>>> Task.scala:41)
>>>
>>> 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught
>>> exception in thread Thread[Executor task launch worker-0,5,main]
>>>
>>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>>>
>>>
>>>
>>
>

Reply via email to