Hong Shen created CARBONDATA-3626:
-------------------------------------
Summary: Improve performance when load data into carbondata
Key: CARBONDATA-3626
URL: https://issues.apache.org/jira/browse/CARBONDATA-3626
Project: CarbonData
Issue Type: Improvement
Components: spark-integration
Reporter: Hong Shen
Attachments: image-2019-12-21-21-18-46-134.png,
image-2019-12-21-21-20-19-603.png
I prepare to use carbondata improve sparksql in our company, but I often found
it's take a long time when load data when the carbon table has many fields.
{code}
carbon.sql("insert into TABLE table1 select * from table2")
{code}
For example, when I use a production table2 with more than 100 columns, When
the above sql is running, one task take 10min to load 200MB data(with snappy
compress), the log is
{code}
2019-12-21 17:31:29 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37975 is: 110
2019-12-21 17:31:35 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37978 is: 64
2019-12-21 17:31:42 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37977 is: 64
2019-12-21 17:31:48 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37972 is: 66
2019-12-21 17:31:54 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37979 is: 68
2019-12-21 17:32:00 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37978 is: 62
2019-12-21 17:32:07 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37981 is: 65
2019-12-21 17:32:13 INFO UnsafeSortDataRows:395 - Time taken to sort row page
with size37972 and write is: 226:
location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949613867659265.sorttemp,
sort temp file size in MB is 5.350312232971191
2019-12-21 17:32:19 INFO UnsafeSortDataRows:395 - Time taken to sort row page
with size37982 and write is: 172:
location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949620209578293.sorttemp,
sort temp file size in MB is 5.293270111083984
2019-12-21 17:32:26 INFO UnsafeSortDataRows:395 - Time taken to sort row page
with size37974 and write is: 175:
location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949626542521877.sorttemp,
sort temp file size in MB is 5.349262237548828
... ...
{code}
The task's jstack is often like below:
{code}
"Executor task launch worker for task 164" #77 daemon prio=5 os_prio=0
tid=0x00002ab5768c3800 nid=0xb895 runnable [0x00002ab578afd000]
java.lang.Thread.State: RUNNABLE
at
scala.collection.LinearSeqOptimized$class.length(LinearSeqOptimized.scala:54)
at scala.collection.immutable.List.length(List.scala:84)
at
org.apache.spark.sql.execution.datasources.CarbonOutputWriter.writeCarbon(SparkCarbonTableFormat.scala:360)
at
org.apache.spark.sql.execution.datasources.AbstractCarbonOutputWriter$class.write(SparkCarbonTableFormat.scala:234)
at
org.apache.spark.sql.execution.datasources.CarbonOutputWriter.write(SparkCarbonTableFormat.scala:239)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$7.apply(FileFormatWriter.scala:717)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$7.apply(FileFormatWriter.scala:661)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:661)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:334)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:332)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1418)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:337)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:215)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:214)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at
org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:379)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1787)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:376)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:621)
at java.lang.Thread.run(Thread.java:849)
{code}
The code is:
!image-2019-12-21-21-18-46-134.png!
!image-2019-12-21-21-20-19-603.png!
It because fieldTypes.length will take a long time when the table has many
fields.
When I edit the code to below, the writeCarbon time will change from 7s to 1s.
{code}
def writeCarbon(row: InternalRow): Unit = {
val data = new Array[AnyRef](fieldTypes.length + partitionData.length)
var i = 0
val fieldTypesLen = fieldTypes.length
while (i < fieldTypesLen) {
if (!row.isNullAt(i)) {
fieldTypes(i) match {
case StringType =>
data(i) = row.getString(i)
case d: DecimalType =>
data(i) = row.getDecimal(i, d.precision, d.scale).toJavaBigDecimal
case other =>
data(i) = row.get(i, other)
}
}
i += 1
}
......
{code}
Here is the new log:
{code}
2019-12-21 20:28:43 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37973 is: 78
2019-12-21 20:28:44 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37979 is: 48
2019-12-21 20:28:45 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37977 is: 45
2019-12-21 20:28:47 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37980 is: 45
2019-12-21 20:28:48 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37977 is: 45
2019-12-21 20:28:49 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37977 is: 44
2019-12-21 20:28:50 INFO UnsafeSortDataRows:416 - Time taken to sort row page
with size: 37976 is: 44
2019-12-21 20:28:52 INFO UnsafeSortDataRows:395 - Time taken to sort row page
with size37977 and write is: 166:
location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365393348305122.sorttemp,
sort temp file size in MB is 5.342463493347168
2019-12-21 20:28:53 INFO UnsafeSortDataRows:395 - Time taken to sort row page
with size37981 and write is: 134:
location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365394590239651.sorttemp,
sort temp file size in MB is 5.291025161743165
2019-12-21 20:28:54 INFO UnsafeSortDataRows:395 - Time taken to sort row page
with size37973 and write is: 131:
location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365395807353135.sorttemp,
sort temp file size in MB is 5.34185791015625
{code}
I will add a patch to improve it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)