[
https://issues.apache.org/jira/browse/KUDU-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke resolved KUDU-2672.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.10.0
> Spark write to kudu, too many machines write to one tserver.
> ------------------------------------------------------------
>
> Key: KUDU-2672
> URL: https://issues.apache.org/jira/browse/KUDU-2672
> Project: Kudu
> Issue Type: Improvement
> Components: java, spark
> Affects Versions: 1.8.0
> Reporter: yangz
> Priority: Major
> Labels: backup, performance
> Fix For: 1.10.0
>
>
> For the spark use case. We sometimes will use spark to write data to kudu.
> Such as import a hive table data to kudu table.
> There will have 2 problems here in current implement.
> # It use a FlushMode.AUTO_FLUSH_BACKGROUND, which is not efficient for error
> processing. When some error happen such as timeout. It will always flush all
> data in the task.Then failed the task. It retry by the task level.
> # For the write mode, spark use default hash way to split data to partition.
> And the hash method is not always meets the tablet distribution. Such as a
> big hive table for 500G size.It will give 2000 task, but we only have 20
> tserver machines. so there will may 2000 machines write at same time to 20
> tserver machines. There will be two bad thing for the performance. First is
> primary key lock, tserver user row lock, so there will so many lock wait. The
> worst case it always timeout for the write operation.Second is there are so
> many machines write data at the same time to tserver. And no any controller
> in the code.
> So we suggest two thing to do
> # Change the flush mode to MANNUL_FLUSH_MODE, and process the error at row
> level. At last at task level.
> # Give an optional repartition step in spark. We can repartition the data by
> the tablet distribution. Then we can get only one machine will write to one
> tserver. There will no lock any more.
> We use this feature for some times. And it solve some problem when write big
> table data to spark.I hope this feature will be useful for the community who
> uses a lot spark with kudu.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)