Hi :
link_crtd_date is a string of format "yyyy-MM-dd" not timestamp.
select link_crtd_date from bsl12.email_edge_lyh_mth1 limit 10;
2018-01-01
2018-01-01
2018-01-01
2018-01-01
2018-01-01
2018-01-01
2018-01-01
2018-01-01
2018-01-01
2018-01-01
Best Regards
Kelly Zhang
At 2019-08-23 15:57:24, "Roland Johann" <roland.joh...@phenetic.io> wrote:
It seems that column `link_crtd_date` is of type `timestamp` and you therefore
partition by date including time, which produces a huge amount of directories.
I assume your intent is to partition by date (partition_date=yyyy-MM-dd or
year=yyyy/month=MM/day=dd) so you need to format/split your timestamp
accordingly, for example:
-- partitioned by 'yyyy-MM-dd'
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (partition_date)
SELECT
date_format(link_crtd_date, 'yyyy-MM-dd') as partition_date,
*
FROM bsl12.email_edge_lyh_mth1;
-- partitioned by year/month/day
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (year, month, day)
SELECT
year(link_crtd_date, 'yyyy-MM-dd') as year,
month(link_crtd_date, 'yyyy-MM-dd') as month,
day(link_crtd_date, 'yyyy-MM-dd') as day,
*
FROM bsl12.email_edge_lyh_mth1;
Best Regards
Roland Johann
Software Developer/Data Engineer
phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany
Mobil: +49 172 365 26 46
Mail: roland.joh...@phenetic.io
Web: phenetic.io
Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann
Am 23.08.2019 um 09:43 schrieb zhangliyun <kelly...@126.com>:
Hi all:
when i use spark dynamic partition feature , i met a problem about hdfs
quota. I found that it is every easy to meet quota problem (exceed the max
value of quota of directory)
I have generated a unpartitioned table 'bsl12.email_edge_lyh_mth1' which
contains 584M records and will insert it to a partitioned table
"bsl12.email_edge_lyh_partitioned2"
--select count(*) from bsl12.email_edge_lyh_mth1; --584652128
--INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2 PARTITION
(link_crtd_date) SELECT * FROM bsl12.email_edge_lyh_mth1;
when i viewed the temporary directory when sql running, i saw multiple file
with link_crd_date=2018-01-01***, I guess one record one temporary file. as
there are 584M data in the unpartitioned table, is there any parameters for us
to control the temporary file count to avoid the quota problem.
```
133
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-0112%3A35%3A29
137
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
12%3A35%3A47
136
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
12%3A38%3A23
132
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
12%3A38%3A54
536
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
12%3A40%3A01
```
Best Regards
Kelly Zhang