[GitHub] [druid] wangxiaobaidu11 commented on pull request #12159: Add Spark Writer support.

GitBox Tue, 15 Feb 2022 10:47:23 -0800


wangxiaobaidu11 commented on pull request #12159:
URL: https://github.com/apache/druid/pull/12159#issuecomment-1039963799



   > @wangxiaobaidu11 there are a number of factors that affect the runtime of 
this connector. I don't know the specifics of your data, but it looks like 
you're trying to use a single-dimension partitioner on a timestamp. If that 
timestamp is the druid time column, you don't need to do that partitioning 
yourself - all druid segments are partitioned on the time column regardless of 
any other partitioning. In your case, the segments you're generating are 
probably ok size-wise (~200 MB) but if you wanted them to have fewer rows (and 
thus have more, smaller segments) you could use the numbered partitioner with a 
target row count. This would increase the parallelism of your spark job and 
allow your writing to happen sooner, but could slow down your query speed. 
You'll have to use your judgement on what's more important to you. You might 
also want to look at the metrics for your import jobs and determine exactly 
where time is being spent - if the time it takes to read in data is small and 
 the job spends most of its time writing to druid, you could check if you're 
memory-bound on your job (in which case giving your executors more memory will 
help) or cpu-bound (in which case you'll need to trade off more executors for 
more files). If you're reading from an external system you also may be able to 
shape your reads in such a way as to minimize or eliminate shuffling in Spark, 
which will greatly speed up your write. Keep in mind that the provided 
partitioners don't have any knowledge of your data and so will be slower than a 
partitioning approach that can take your data in to account.
   > 
   > More generally, the writing logic is some of the oldest in the connectors 
and there is likely substantial room for improvement in performance. Because 
the write performance has been mostly acceptable to users, I've been focused on 
getting these connectors merged into Druid rather than further latency or 
throughput improvements but hopefully Druid committers like @jihoonson will 
have some useful feedback in their reviews.
   
   Thank you for your answer！


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] wangxiaobaidu11 commented on pull request #12159: Add Spark Writer support.

Reply via email to