[
https://issues.apache.org/jira/browse/HADOOP-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829230#comment-16829230
]
Steve Loughran commented on HADOOP-15616:
-----------------------------------------
I'm reluctant look into the code internals right now, but I will comment on
that thread pool issue
* make sure you have an HTTP whose upper limit >= that of the thread pool
(maybe: drive them with the same options.
* threads which are IO heavy do put load on the other work; those threads
blocking for server-side ops (copy) or higher latency ops are lower cost.
* Java has historically overcounted the #of cores available in a docker image,
picking up the #of CPUs on the host, not that of the container. Which means
that prior to java 8 u131, its possible to ask for many more threads than are
viable. All is fixed in that java version and later though.
Putting it together: I have no idea what makes a good thread pool size. Some
deployments (Impala S3A) use thousands of threads because its a single large
process making many requests of the same store. Other things: Spark workers,
need to have 1+ thread and http connection per worker for best performance
writing data, but if you have too many then it uses up resources. Hive LLAP
suffers here: it creates different FS instances for different users, and then
destroys them after the work. Many FS instances == many threads, many http
connections. But the cost of negotiating HTTPS connections argues in favour of
a big pool of those too.
I would really love to see some numbers here. Maybe as [~DanielZhou] is
benchmarking ABFS perf, he might have some suggestions for everyone else to
follow.
If you aren't already, take a look at BlockingThreadPoolExecutorService which
is used in S3A and Aliyun OSS as a way of throttling thread use for a single
operation (block write, renames etc). This ensures that even with a fixed
thread pool no single bit of work starves other threads interacting with the
same FS instance. That's easy to do in things like parallel renaming where I
can queue unlimited amounts of file renames and then wait for a result
(example: latest patch for HADOOP-15183 + dynamodb interaction). That gives
great results in microbenchmarks, but in production systems is too antisocial.
> Incorporate Tencent Cloud COS File System Implementation
> --------------------------------------------------------
>
> Key: HADOOP-15616
> URL: https://issues.apache.org/jira/browse/HADOOP-15616
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs/cos
> Reporter: Junping Du
> Assignee: YangY
> Priority: Major
> Attachments: HADOOP-15616.001.patch, HADOOP-15616.002.patch,
> HADOOP-15616.003.patch, HADOOP-15616.004.patch, HADOOP-15616.005.patch,
> HADOOP-15616.006.patch, HADOOP-15616.007.patch, HADOOP-15616.008.patch,
> HADOOP-15616.009.patch, Junping Du.url, Tencent-COS-Integrated-v2.pdf,
> Tencent-COS-Integrated.pdf
>
>
> Tencent cloud is top 2 cloud vendors in China market and the object store COS
> ([https://intl.cloud.tencent.com/product/cos]) is widely used among China’s
> cloud users but now it is hard for hadoop user to access data laid on COS
> storage as no native support for COS in Hadoop.
> This work aims to integrate Tencent cloud COS with Hadoop/Spark/Hive, just
> like what we do before for S3, ADL, OSS, etc. With simple configuration,
> Hadoop applications can read/write data from COS without any code change.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]