[ 
https://issues.apache.org/jira/browse/HADOOP-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829230#comment-16829230
 ] 

Steve Loughran commented on HADOOP-15616:
-----------------------------------------

I'm reluctant look into the code internals right now, but I will comment on 
that thread pool issue

* make sure you have an HTTP whose upper limit >= that of the thread pool 
(maybe: drive them with the same options.
* threads which are IO heavy do put load on the other work; those threads 
blocking for server-side ops (copy) or higher latency ops are lower cost.
* Java has historically overcounted the #of cores available in a docker image, 
picking up the #of CPUs on the host, not that of the container. Which means 
that prior to java 8 u131, its possible to ask for many more threads than are 
viable. All is fixed in that java version and later though.

Putting it together: I have no idea what makes a good thread pool size. Some 
deployments (Impala  S3A) use thousands of threads because its a single large 
process making many requests of the same store. Other things: Spark workers, 
need to have 1+ thread and http connection per worker for best performance 
writing data, but if you have too many then it uses up resources. Hive LLAP 
suffers here: it creates different FS instances for different users, and then 
destroys them after the work. Many FS instances == many threads, many http 
connections. But the cost of negotiating HTTPS connections argues in favour of 
a big pool of those too.

I would really love to see some numbers here. Maybe as [~DanielZhou] is 
benchmarking ABFS perf, he might have some suggestions for everyone else to 
follow.

If you aren't already, take a look at BlockingThreadPoolExecutorService which 
is used in S3A and Aliyun OSS as a way of throttling thread use for a single 
operation (block write, renames etc). This ensures that even with a fixed 
thread pool no single bit of work starves other threads interacting with the 
same FS instance. That's easy to do in things like parallel renaming where I 
can queue unlimited amounts of file renames and then wait for a result 
(example: latest patch for HADOOP-15183 + dynamodb interaction). That gives 
great results in microbenchmarks, but in production systems is too antisocial.


> Incorporate Tencent Cloud COS File System Implementation
> --------------------------------------------------------
>
>                 Key: HADOOP-15616
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15616
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs/cos
>            Reporter: Junping Du
>            Assignee: YangY
>            Priority: Major
>         Attachments: HADOOP-15616.001.patch, HADOOP-15616.002.patch, 
> HADOOP-15616.003.patch, HADOOP-15616.004.patch, HADOOP-15616.005.patch, 
> HADOOP-15616.006.patch, HADOOP-15616.007.patch, HADOOP-15616.008.patch, 
> HADOOP-15616.009.patch, Junping Du.url, Tencent-COS-Integrated-v2.pdf, 
> Tencent-COS-Integrated.pdf
>
>
> Tencent cloud is top 2 cloud vendors in China market and the object store COS 
> ([https://intl.cloud.tencent.com/product/cos]) is widely used among China’s 
> cloud users but now it is hard for hadoop user to access data laid on COS 
> storage as no native support for COS in Hadoop.
> This work aims to integrate Tencent cloud COS with Hadoop/Spark/Hive, just 
> like what we do before for S3, ADL, OSS, etc. With simple configuration, 
> Hadoop applications can read/write data from COS without any code change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to