Hi,
I am using CarbonData1.3 + Spark2.1,My code is:
val df = carbonSession.sql(“select * from t where name like ‘aaa%’”)
df.coalesce(n).write.saveAsTable(“r”) // you can set n=1 to reproduce
this issue
The job aborted with omm error. I analyze the dump and find that there are
hundreds of DimensionRawColumnChunk object, each object occupies 50M memory
as the screenshot shows.
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t174/Screenshot-1.png>
I investigate the source code of CarbonScanRDD and find out the root
cause of this issue is related to this code snippet:
context.addTaskCompletionListener{ _=>
reader.close()
close()
}
TaskContext object holds reader’s reference until the task finished and
coalesce combines a lot of CarbonSparkPartition into one task.My proposal
for this issue is :
(1) Explicitly set some object as null when it is not used so that it will
be released as early as possible. For example, In DimensionRawColumnChunk
freeMemory function set rawData=null; I made a test and this really works.
(2)TaskContext object should not always hold reader’s reference or it
should not hold so many readers. Currently, I have no idea how to implement
this.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/