[
https://issues.apache.org/jira/browse/CARBONDATA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hong Shen updated CARBONDATA-3626:
----------------------------------
Attachment: screenshot-1.png
> Improve performance when load data into carbondata
> --------------------------------------------------
>
> Key: CARBONDATA-3626
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3626
> Project: CarbonData
> Issue Type: Improvement
> Components: spark-integration
> Reporter: Hong Shen
> Priority: Major
> Attachments: image-2019-12-21-21-18-46-134.png,
> image-2019-12-21-21-20-19-603.png, screenshot-1.png
>
>
> I prepare to use carbondata improve sparksql in our company, but I often
> found it's take a long time when load data when the carbon table has many
> fields.
> {code}
> carbon.sql("insert into TABLE table1 select * from table2")
> {code}
> For example, when I use a production table2 with more than 100 columns, When
> the above sql is running, one task take 10min to load 200MB data(with snappy
> compress), the log is
> {code}
> 2019-12-21 17:31:29 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37975 is: 110
> 2019-12-21 17:31:35 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37978 is: 64
> 2019-12-21 17:31:42 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37977 is: 64
> 2019-12-21 17:31:48 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37972 is: 66
> 2019-12-21 17:31:54 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37979 is: 68
> 2019-12-21 17:32:00 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37978 is: 62
> 2019-12-21 17:32:07 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37981 is: 65
> 2019-12-21 17:32:13 INFO UnsafeSortDataRows:395 - Time taken to sort row
> page with size37972 and write is: 226:
> location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949613867659265.sorttemp,
> sort temp file size in MB is 5.350312232971191
> 2019-12-21 17:32:19 INFO UnsafeSortDataRows:395 - Time taken to sort row
> page with size37982 and write is: 172:
> location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949620209578293.sorttemp,
> sort temp file size in MB is 5.293270111083984
> 2019-12-21 17:32:26 INFO UnsafeSortDataRows:395 - Time taken to sort row
> page with size37974 and write is: 175:
> location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192937/carbon19a2dc8d381442129dd0c7d906e7f51f_102100013100001/Fact/Part0/Segment_2/102100013100001/sortrowtmp/table2_0_21949626542521877.sorttemp,
> sort temp file size in MB is 5.349262237548828
> ... ...
> {code}
> The task's jstack is often like below:
> {code}
> "Executor task launch worker for task 164" #77 daemon prio=5 os_prio=0
> tid=0x00002ab5768c3800 nid=0xb895 runnable [0x00002ab578afd000]
> java.lang.Thread.State: RUNNABLE
> at
> scala.collection.LinearSeqOptimized$class.length(LinearSeqOptimized.scala:54)
> at scala.collection.immutable.List.length(List.scala:84)
> at
> org.apache.spark.sql.execution.datasources.CarbonOutputWriter.writeCarbon(SparkCarbonTableFormat.scala:360)
> at
> org.apache.spark.sql.execution.datasources.AbstractCarbonOutputWriter$class.write(SparkCarbonTableFormat.scala:234)
> at
> org.apache.spark.sql.execution.datasources.CarbonOutputWriter.write(SparkCarbonTableFormat.scala:239)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$7.apply(FileFormatWriter.scala:717)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$7.apply(FileFormatWriter.scala:661)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:661)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:334)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:332)
> at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1418)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:337)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:215)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:214)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:379)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:360)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1787)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:376)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:621)
> at java.lang.Thread.run(Thread.java:849)
> {code}
> The code is:
> !image-2019-12-21-21-18-46-134.png!
> !image-2019-12-21-21-20-19-603.png!
> It because fieldTypes.length will take a long time when the table has many
> fields.
> When I edit the code to below, the writeCarbon time will change from 7s to 1s.
> {code}
> def writeCarbon(row: InternalRow): Unit = {
> val data = new Array[AnyRef](fieldTypes.length + partitionData.length)
> var i = 0
> val fieldTypesLen = fieldTypes.length
> while (i < fieldTypesLen) {
> if (!row.isNullAt(i)) {
> fieldTypes(i) match {
> case StringType =>
> data(i) = row.getString(i)
> case d: DecimalType =>
> data(i) = row.getDecimal(i, d.precision, d.scale).toJavaBigDecimal
> case other =>
> data(i) = row.get(i, other)
> }
> }
> i += 1
> }
> ......
> {code}
> Here is the new log:
> {code}
> 2019-12-21 20:28:43 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37973 is: 78
> 2019-12-21 20:28:44 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37979 is: 48
> 2019-12-21 20:28:45 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37977 is: 45
> 2019-12-21 20:28:47 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37980 is: 45
> 2019-12-21 20:28:48 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37977 is: 45
> 2019-12-21 20:28:49 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37977 is: 44
> 2019-12-21 20:28:50 INFO UnsafeSortDataRows:416 - Time taken to sort row
> page with size: 37976 is: 44
> 2019-12-21 20:28:52 INFO UnsafeSortDataRows:395 - Time taken to sort row
> page with size37977 and write is: 166:
> location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365393348305122.sorttemp,
> sort temp file size in MB is 5.342463493347168
> 2019-12-21 20:28:53 INFO UnsafeSortDataRows:395 - Time taken to sort row
> page with size37981 and write is: 134:
> location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365394590239651.sorttemp,
> sort temp file size in MB is 5.291025161743165
> 2019-12-21 20:28:54 INFO UnsafeSortDataRows:395 - Time taken to sort row
> page with size37973 and write is: 131:
> location:/home/hadoop/nm-local-dir/usercache/042986/appcache/application_1571110627213_192991/carbon67f270c8ba0a42f38dc7e20335e4999f_103100008100001/Fact/Part0/Segment_3/103100008100001/sortrowtmp/table2_0_1365395807353135.sorttemp,
> sort temp file size in MB is 5.34185791015625
> {code}
> I will add a patch to improve it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)