[jira] [Created] (HADOOP-18132) S3 exponential backoff

2022-02-18 Thread Holden Karau (Jira)
Holden Karau created HADOOP-18132:
-

 Summary: S3 exponential backoff
 Key: HADOOP-18132
 URL: https://issues.apache.org/jira/browse/HADOOP-18132
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs/s3
Reporter: Holden Karau


S3 API has limits which we can exceed when using a large number of 
writers/readers/or listers. We should add randomized-exponential back-off to 
the s3 client when it encounters:

 

com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request 
rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-18131) Upgrade maven enforcer plugin and relevant dependencies

2022-02-18 Thread Viraj Jasani (Jira)
Viraj Jasani created HADOOP-18131:
-

 Summary: Upgrade maven enforcer plugin and relevant dependencies
 Key: HADOOP-18131
 URL: https://issues.apache.org/jira/browse/HADOOP-18131
 Project: Hadoop Common
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Maven enforcer plugin's latest version 3.0.0 has some noticeable improvements 
(e.g. MENFORCER-350, MENFORCER-388, MENFORCER-353) and fixes for us to 
incorporate. Besides, some of the relevant enforcer dependencies (e.g. extra 
enforcer rules and restrict import enforcer) too have good improvements.

We should upgrade maven enforcer plugin and the relevant dependencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



Apache Hadoop qbt Report: branch-2.10+JDK7 on Linux/x86_64

2022-02-18 Thread Apache Jenkins Server
For more details, see 
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/

No changes




-1 overall


The following subsystems voted -1:
asflicense hadolint mvnsite pathlen unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

Failed junit tests :

   hadoop.io.compress.snappy.TestSnappyCompressorDecompressor 
   hadoop.fs.TestFileUtil 
   hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys 
   hadoop.hdfs.server.balancer.TestBalancer 
   
hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain 
   hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints 
   hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints 
   hadoop.hdfs.server.federation.resolver.order.TestLocalResolver 
   hadoop.hdfs.server.federation.router.TestRouterQuota 
   hadoop.hdfs.server.federation.router.TestRouterNamenodeHeartbeat 
   hadoop.hdfs.server.federation.resolver.TestMultipleDestinationResolver 
   
hadoop.yarn.server.resourcemanager.monitor.invariants.TestMetricsInvariantChecker
 
   hadoop.yarn.server.resourcemanager.TestClientRMService 
   hadoop.mapreduce.lib.input.TestLineRecordReader 
   hadoop.mapreduce.jobhistory.TestHistoryViewerPrinter 
   hadoop.mapred.TestLineRecordReader 
   hadoop.mapreduce.v2.app.rm.TestRMContainerAllocator 
   hadoop.yarn.sls.TestSLSRunner 
   hadoop.resourceestimator.service.TestResourceEstimatorService 
   hadoop.resourceestimator.solver.impl.TestLpSolver 
  

   cc:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/diff-compile-cc-root.txt
  [4.0K]

   javac:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/diff-compile-javac-root.txt
  [476K]

   checkstyle:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/diff-checkstyle-root.txt
  [14M]

   hadolint:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/diff-patch-hadolint.txt
  [4.0K]

   mvnsite:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-mvnsite-root.txt
  [1.2M]

   pathlen:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/pathlen.txt
  [12K]

   pylint:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/diff-patch-pylint.txt
  [20K]

   shellcheck:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/diff-patch-shellcheck.txt
  [72K]

   whitespace:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/whitespace-eol.txt
  [12M]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/whitespace-tabs.txt
  [1.3M]

   javadoc:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-javadoc-root.txt
  [40K]

   unit:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
  [224K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
  [436K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs_src_contrib_bkjournal.txt
  [12K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
  [36K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt
  [20K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
  [128K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt
  [104K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
  [36K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/577/artifact/out/patch-unit-hadoop-tools_hadoop-azure.txt
  [20K]
   

[jira] [Resolved] (HADOOP-18117) Add an option to preserve root directory permissions

2022-02-18 Thread Hui Fei (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui Fei resolved HADOOP-18117.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> Add an option to preserve root directory permissions
> 
>
> Key: HADOOP-18117
> URL: https://issues.apache.org/jira/browse/HADOOP-18117
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Mohanad Elsafty
>Assignee: Mohanad Elsafty
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> As mentioned in https://issues.apache.org/jira/browse/HADOOP-15211
>  
> If *-update* or *-overwrite* is being passed when *distcp* used, the root 
> directory will be skipped in two occasions (CopyListing#doBuildListing & 
> CopyCommitter#preserveFileAttributesForDirectories), which will ignore root 
> directory's attributes.
>  
> We face the same issue when distcp huge data between clusters and it takes 
> too much effort to update root directories attributes manually.
>  
> From the earlier ticket it's obvious why this behaviour is there, but 
> sometime we need to enforce root directory update hence I will add a new 
> option for distcp to enable someone (who understands the need of this and 
> know what they are doing) to enforce the update of root directory's 
> attributes (permissions, ownership, ...)
>  
> It should be simple one, something like this
> {code:java}
> $ hadoop distcp -p -update -updateRootDirectoryAttributes /a/b/c /a/b/d {code}
> This behaviour is optional and will be *false* by default. (it should not 
> affect existing *distcp* users).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



Too many S3 API calls for simple queries in hive and databricks spark like select and create external table

2022-02-18 Thread Anup Tiwari
Hi Team,

We are using Hive/Spark(DataBricks) heavily for our ETL where our data is
stored on S3 and so we have seen a strange behaviour between Hive/Spark &
S3 interaction in terms of S3 API calls i.e. Actual number of API calls for
simple select statements were too much compared to expected and *since
Hive/Spark internally uses hadoop provided libraries which are
hadoop-aws*jar/aws-java-sdk*jar* so asking this question here.

Let us know why it is behaving like this because if we execute the same
select statement via Athena then the number of API calls are very less.



*Background :-*
We are incurring some S3 API cost and to understand each API call better,
we decided to do simple testing.

1. We have a non partition table containing a lot of objects in parquet
format on S3.

2. We copied one parquet file object(data) to a separate S3 bucket(target)
so now our target bucket contains one parquet file data in following
hierarchy on S3 :-
s3:///Test/00_0   (Size of object : 1218 Bytes)

3. After that, we have executed following 3 command in Apache Hive 2.1.1
managed by us on EC2 cluster :-

(i) Create External table on top of above S3 location :-

CREATE EXTERNAL TABLE `anup.Test`(
  `id` int,
  `cname` varchar(45),
  `mef` decimal(10,3),
  `mlpr` int,
  `qperiod` int,
  `validity` int,
  `rpmult` decimal(10,3))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a:///Test' ;

(ii) msck repair table Test(Just to test behaviour) ;
(iii) Simple select statement :- select * from anup.Test ;




*Results :-*
Ideally, we were *expecting max 5-10 API calls* with below breakdown


1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
create "Test/" object.
2. msck repair : 1-2 API calls ; since we have single object behind table
3. select *  : 1-2 API calls ; since we have single object behind table


But *Actual number of Total API calls was 37* and we have fetched this from
S3 Access Logs via Athena. Breakdown of these calls are as follows :-

1. Create External : 9 API calls
2. msck repair : 3 API calls
3. select *  : 25 API calls


Attaching actual results of S3 Access Logs for select command along with
DEBUG logs of Hive for select statement.

Let us know why so many API calls are happening for the Create External /
select statement because if we execute the same select statement *via
Athena* then the number of API calls are very less i.e. *2*.




*Tools / S3 library details :-*
Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
aws-java-sdk-core-1.10.6.jar


Regards,
Anup Tiwari


Hive commands _ S3 Access Logs.xlsx
Description: MS-Excel 2007 spreadsheet

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org