Hi :
   link_crtd_date  is a string of format "yyyy-MM-dd"  not timestamp.


 select link_crtd_date from bsl12.email_edge_lyh_mth1 limit 10;

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01







Best Regards
Kelly Zhang





At 2019-08-23 15:57:24, "Roland Johann" <roland.joh...@phenetic.io> wrote:
It seems that column `link_crtd_date` is of type `timestamp` and you therefore 
partition by date including time, which produces a huge amount of directories. 
I assume your intent is to partition by date (partition_date=yyyy-MM-dd or 
year=yyyy/month=MM/day=dd) so you need to format/split your timestamp 
accordingly, for example:


-- partitioned by 'yyyy-MM-dd'
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (partition_date)
SELECT
date_format(link_crtd_date, 'yyyy-MM-dd') as partition_date,
*
FROM bsl12.email_edge_lyh_mth1;

-- partitioned by year/month/day
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (year, month, day)
SELECT
    year(link_crtd_date, 'yyyy-MM-dd') as year,
    month(link_crtd_date, 'yyyy-MM-dd') as month,
    day(link_crtd_date, 'yyyy-MM-dd') as day,
*
FROM bsl12.email_edge_lyh_mth1;

Best Regards

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.joh...@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann






Am 23.08.2019 um 09:43 schrieb zhangliyun <kelly...@126.com>:


Hi all:
  when i use spark dynamic partition feature , i met a problem about hdfs 
quota.  I found that it is every easy to meet quota problem (exceed the max 
value of quota of directory)


I have generated a unpartitioned table 'bsl12.email_edge_lyh_mth1' which 
contains 584M records and will insert it to a  partitioned table 
"bsl12.email_edge_lyh_partitioned2"
--select count(*) from bsl12.email_edge_lyh_mth1; --584652128
--INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2 PARTITION 
(link_crtd_date) SELECT * FROM bsl12.email_edge_lyh_mth1;



when i viewed the temporary directory when sql running, i saw  multiple  file 
with link_crd_date=2018-01-01***, I guess one record one temporary file.  as  
there are 584M data in the unpartitioned table,  is there any parameters for us 
to control the temporary file count  to avoid the quota problem.

```

 

133    
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-0112%3A35%3A29
137    
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
 12%3A35%3A47
136    
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
 12%3A38%3A23
132    
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
 12%3A38%3A54
536    
hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01
 12%3A40%3A01


```


Best Regards


Kelly Zhang



 


Reply via email to