wangxiaobaidu11 commented on pull request #12159: URL: https://github.com/apache/druid/pull/12159#issuecomment-1039963799
> @wangxiaobaidu11 there are a number of factors that affect the runtime of this connector. I don't know the specifics of your data, but it looks like you're trying to use a single-dimension partitioner on a timestamp. If that timestamp is the druid time column, you don't need to do that partitioning yourself - all druid segments are partitioned on the time column regardless of any other partitioning. In your case, the segments you're generating are probably ok size-wise (~200 MB) but if you wanted them to have fewer rows (and thus have more, smaller segments) you could use the numbered partitioner with a target row count. This would increase the parallelism of your spark job and allow your writing to happen sooner, but could slow down your query speed. You'll have to use your judgement on what's more important to you. You might also want to look at the metrics for your import jobs and determine exactly where time is being spent - if the time it takes to read in data is small and the job spends most of its time writing to druid, you could check if you're memory-bound on your job (in which case giving your executors more memory will help) or cpu-bound (in which case you'll need to trade off more executors for more files). If you're reading from an external system you also may be able to shape your reads in such a way as to minimize or eliminate shuffling in Spark, which will greatly speed up your write. Keep in mind that the provided partitioners don't have any knowledge of your data and so will be slower than a partitioning approach that can take your data in to account. > > More generally, the writing logic is some of the oldest in the connectors and there is likely substantial room for improvement in performance. Because the write performance has been mostly acceptable to users, I've been focused on getting these connectors merged into Druid rather than further latency or throughput improvements but hopefully Druid committers like @jihoonson will have some useful feedback in their reviews. Thank you for your answer! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
