[jira] [Commented] (KYLIN-4343) Build Global Dict by MR/Hive, new config

ASF GitHub Bot (Jira) Tue, 16 Jun 2020 05:20:15 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136593#comment-17136593
 ]


ASF GitHub Bot commented on KYLIN-4343:
---------------------------------------

hit-lacus edited a comment on pull request #1267:
URL: https://github.com/apache/kylin/pull/1267#issuecomment-644727013


   ###  Create table
   
   
   ```sh
   zookeeper lock path :/mr_dict_ephemeral_lock/UserActionCubeByHive_NO2, 
result is false
   zookeeper get lock costTime : 0 s
   Build Hive Global Dictionary by: hive -e "set mapreduce.job.name=Build Hive 
Global Dict - extract distinct value;
   USE LACUS;
   
   set hive.exec.compress.output=false;set hive.mapred.mode=unstrict;CREATE 
TABLE IF NOT EXISTS LACUS.UserActionCubeByHive_NO2_global_dict
    ( dict_key STRING COMMENT '', 
      dict_val INT COMMENT '' 
   ) 
   COMMENT 'Hive Global Dictionary' 
   PARTITIONED BY (dict_column string) 
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
   STORED AS TEXTFILE; 
   DROP TABLE IF EXISTS 
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value;
 
   CREATE TABLE IF NOT EXISTS 
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
 
   ( 
      dict_key STRING COMMENT '' 
   ) 
   COMMENT '' 
   PARTITIONED BY (dict_column string) 
   STORED AS TEXTFILE 
   ;
   DROP TABLE IF EXISTS 
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970_global_dict;
 
   CREATE TABLE IF NOT EXISTS 
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970_global_dict
 
   ( 
      dict_key STRING COMMENT '' , 
     dict_val STRING COMMENT '' 
   ) 
   COMMENT '' 
   PARTITIONED BY (dict_column string) 
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
   STORED AS TEXTFILE 
   ;
   INSERT OVERWRITE TABLE 
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
 
   PARTITION (dict_column = 'USERACTIONLOGSAMPLE_PLAY_ID') 
   SELECT a.DICT_KEY FROM (
     SELECT 
   USERACTIONLOGSAMPLE_PLAY_ID as DICT_KEY 
     FROM 
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970
     GROUP BY USERACTIONLOGSAMPLE_PLAY_ID) a 
       LEFT JOIN 
     (SELECT DICT_KEY FROM LACUS.UserActionCubeByHive_NO2_global_dict    WHERE 
DICT_COLUMN = 'USERACTIONLOGSAMPLE_PLAY_ID' ) b 
   ON a.DICT_KEY = b.DICT_KEY 
   WHERE b.DICT_KEY IS NULL 
   ;
   INSERT OVERWRITE TABLE 
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
 
   PARTITION (dict_column = 'USERACTIONLOGSAMPLE_PLAY_DURATION') 
   SELECT a.DICT_KEY FROM (
     SELECT 
   USERACTIONLOGSAMPLE_PLAY_DURATION as DICT_KEY 
     FROM 
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970
     GROUP BY USERACTIONLOGSAMPLE_PLAY_DURATION) a 
       LEFT JOIN 
     (SELECT DICT_KEY FROM LACUS.UserActionCubeByHive_NO2_global_dict    WHERE 
DICT_COLUMN = 'USERACTIONLOGSAMPLE_PLAY_DURATION' ) b 
   ON a.DICT_KEY = b.DICT_KEY 
   WHERE b.DICT_KEY IS NULL 
   ;
   INSERT OVERWRITE TABLE  
kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
 PARTITION (DICT_COLUMN = 'KYLIN_MAX_DISTINCT_COUNT') 
   SELECT CONCAT_WS(',', tc.dict_column, cast(tc.total_distinct_val AS String), 
if(tm.max_dict_val is null, '0', cast(max_dict_val as string))) 
   FROM (
       SELECT dict_column, count(1) total_distinct_val
       FROM 
LACUS.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
       WHERE DICT_COLUMN != 'KYLIN_MAX_DISTINCT_COUNT'
       GROUP BY dict_column) tc 
   LEFT JOIN (
   
       SELECT dict_column, if(max(dict_val) is null, 0, max(dict_val)) as 
max_dict_val 
       FROM LACUS.UserActionCubeByHive_NO2_global_dict
       GROUP BY dict_column) tm 
   ON tc.dict_column = tm.dict_column;
   " --hiveconf hive.merge.mapredfiles=false --hiveconf 
hive.auto.convert.join=true --hiveconf dfs.replication=2 --hiveconf 
hive.exec.compress.output=true --hiveconf 
hive.auto.convert.join.noconditionaltask=true --hiveconf 
mapreduce.job.split.metainfo.maxsize=-1 --hiveconf hive.merge.mapfiles=false 
--hiveconf hive.auto.convert.join.noconditionaltask.size=100000000 --hiveconf 
hive.stats.autogather=true
   ls: cannot access 
/root/lib/spark-2.3.3-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or 
directory
   Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; 
support was removed in 8.0
   Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is 
deprecated and will likely be removed in a future release
   Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; 
support was removed in 8.0
   
   Logging initialized using configuration in 
jar:file:/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/jars/hive-common-1.1.0-cdh5.7.6.jar!/hive-log4j.properties
   OK
   Time taken: 1.997 seconds
   OK
   Time taken: 0.45 seconds
   OK
   Time taken: 0.084 seconds
   OK
   Time taken: 0.165 seconds
   OK
   Time taken: 0.056 seconds
   OK
   Time taken: 0.175 seconds
   Query ID = root_20200616195151_5b911606-e6ed-4ff5-80a8-bc3091f6064b
   Total jobs = 3
   Launching Job 1 out of 3
   Number of reduce tasks not specified. Estimated from input data size: 1
   In order to change the average load for a reducer (in bytes):
     set hive.exec.reducers.bytes.per.reducer=<number>
   In order to limit the maximum number of reducers:
     set hive.exec.reducers.max=<number>
   In order to set a constant number of reducers:
     set mapreduce.job.reduces=<number>
   Starting Job = job_1589169585068_5803, Tracking URL = 
http://cdh-master:8088/proxy/application_1589169585068_5803/
   Kill Command = 
/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/bin/../lib/hadoop/bin/hadoop 
job  -kill job_1589169585068_5803
   Hadoop job information for Stage-1: number of mappers: 1; number of 
reducers: 1
   2020-06-16 19:51:38,155 Stage-1 map = 0%,  reduce = 0%
   2020-06-16 19:51:43,328 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
2.23 sec
   2020-06-16 19:51:49,505 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
4.77 sec
   MapReduce Total cumulative CPU time: 4 seconds 770 msec
   Ended Job = job_1589169585068_5803
   Stage-7 is selected by condition resolver.
   Stage-2 is filtered out by condition resolver.
   Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; 
support was removed in 8.0
   Execution log at: 
/tmp/root/root_20200616195151_5b911606-e6ed-4ff5-80a8-bc3091f6064b.log
   2020-06-16 07:51:53  Starting to launch local task to process map join;      
maximum memory = 1908932608
   2020-06-16 07:51:53  Dump the side-table for tag: 1 with group count: 0 into 
file: 
file:/tmp/root/f4ddf013-1d00-4e3e-9b46-c4ffbfa79e0e/hive_2020-06-16_19-51-30_297_4198334643994193254-1/-local-10003/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
   2020-06-16 07:51:53  Uploaded 1 File to: 
file:/tmp/root/f4ddf013-1d00-4e3e-9b46-c4ffbfa79e0e/hive_2020-06-16_19-51-30_297_4198334643994193254-1/-local-10003/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
 (260 bytes)
   2020-06-16 07:51:53  End of local task; Time Taken: 0.432 sec.
   Execution completed successfully
   MapredLocal task succeeded
   Launching Job 3 out of 3
   Number of reduce tasks is set to 0 since there's no reduce operator
   Starting Job = job_1589169585068_5804, Tracking URL = 
http://cdh-master:8088/proxy/application_1589169585068_5804/
   Kill Command = 
/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/bin/../lib/hadoop/bin/hadoop 
job  -kill job_1589169585068_5804
   Hadoop job information for Stage-5: number of mappers: 1; number of 
reducers: 0
   2020-06-16 19:51:59,633 Stage-5 map = 0%,  reduce = 0%
   2020-06-16 19:52:04,794 Stage-5 map = 100%,  reduce = 0%, Cumulative CPU 
3.58 sec
   MapReduce Total cumulative CPU time: 3 seconds 580 msec
   Ended Job = job_1589169585068_5804
   Loading data to table 
lacus.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
 partition (dict_column=USERACTIONLOGSAMPLE_PLAY_ID)
   Partition 
lacus.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value{dict_column=USERACTIONLOGSAMPLE_PLAY_ID}
 stats: [numFiles=1, numRows=10000, totalSize=527979, rawDataSize=517979]
   MapReduce Jobs Launched: 
   Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.77 sec   HDFS Read: 
502687 HDFS Write: 704955 SUCCESS
   Stage-Stage-5: Map: 1   Cumulative CPU: 3.58 sec   HDFS Read: 710940 HDFS 
Write: 528186 SUCCESS
   Total MapReduce CPU Time Spent: 8 seconds 350 msec
   OK
   Time taken: 37.21 seconds
   Query ID = root_20200616195252_edb3c5ad-fa39-445c-a214-dc68d566e30e
   Total jobs = 3
   Launching Job 1 out of 3
   Number of reduce tasks not specified. Estimated from input data size: 1
   In order to change the average load for a reducer (in bytes):
     set hive.exec.reducers.bytes.per.reducer=<number>
   In order to limit the maximum number of reducers:
     set hive.exec.reducers.max=<number>
   In order to set a constant number of reducers:
     set mapreduce.job.reduces=<number>
   Starting Job = job_1589169585068_5805, Tracking URL = 
http://cdh-master:8088/proxy/application_1589169585068_5805/
   Kill Command = 
/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/bin/../lib/hadoop/bin/hadoop 
job  -kill job_1589169585068_5805
   Hadoop job information for Stage-1: number of mappers: 1; number of 
reducers: 1
   2020-06-16 19:52:13,708 Stage-1 map = 0%,  reduce = 0%
   2020-06-16 19:52:18,853 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
2.45 sec
   2020-06-16 19:52:25,018 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
5.24 sec
   MapReduce Total cumulative CPU time: 5 seconds 240 msec
   Ended Job = job_1589169585068_5805
   Stage-7 is selected by condition resolver.
   Stage-2 is filtered out by condition resolver.
   Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; 
support was removed in 8.0
   Execution log at: 
/tmp/root/root_20200616195252_edb3c5ad-fa39-445c-a214-dc68d566e30e.log
   2020-06-16 07:52:28  Starting to launch local task to process map join;      
maximum memory = 1908932608
   2020-06-16 07:52:29  Dump the side-table for tag: 1 with group count: 0 into 
file: 
file:/tmp/root/f4ddf013-1d00-4e3e-9b46-c4ffbfa79e0e/hive_2020-06-16_19-52-07_533_3381191652948885785-1/-local-10003/HashTable-Stage-5/MapJoin-mapfile11--.hashtable
   2020-06-16 07:52:29  Uploaded 1 File to: 
file:/tmp/root/f4ddf013-1d00-4e3e-9b46-c4ffbfa79e0e/hive_2020-06-16_19-52-07_533_3381191652948885785-1/-local-10003/HashTable-Stage-5/MapJoin-mapfile11--.hashtable
 (260 bytes)
   2020-06-16 07:52:29  End of local task; Time Taken: 0.574 sec.
   Execution completed successfully
   MapredLocal task succeeded
   Launching Job 3 out of 3
   Number of reduce tasks is set to 0 since there's no reduce operator
   Starting Job = job_1589169585068_5806, Tracking URL = 
http://cdh-master:8088/proxy/application_1589169585068_5806/
   Kill Command = 
/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/bin/../lib/hadoop/bin/hadoop 
job  -kill job_1589169585068_5806
   Hadoop job information for Stage-5: number of mappers: 1; number of 
reducers: 0
   2020-06-16 19:52:35,441 Stage-5 map = 0%,  reduce = 0%
   2020-06-16 19:52:41,604 Stage-5 map = 100%,  reduce = 0%, Cumulative CPU 
3.71 sec
   MapReduce Total cumulative CPU time: 3 seconds 710 msec
   Ended Job = job_1589169585068_5806
   Loading data to table 
lacus.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
 partition (dict_column=USERACTIONLOGSAMPLE_PLAY_DURATION)
   Partition 
lacus.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value{dict_column=USERACTIONLOGSAMPLE_PLAY_DURATION}
 stats: [numFiles=1, numRows=4098, totalSize=29177, rawDataSize=25079]
   MapReduce Jobs Launched: 
   Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.24 sec   HDFS Read: 
502807 HDFS Write: 89619 SUCCESS
   Stage-Stage-5: Map: 1   Cumulative CPU: 3.71 sec   HDFS Read: 95878 HDFS 
Write: 29388 SUCCESS
   Total MapReduce CPU Time Spent: 8 seconds 950 msec
   OK
   Time taken: 35.619 seconds
   Query ID = root_20200616195252_5869795b-713e-4066-b062-48c0ff08d4a1
   Total jobs = 4
   Launching Job 1 out of 4
   Number of reduce tasks not specified. Estimated from input data size: 1
   In order to change the average load for a reducer (in bytes):
     set hive.exec.reducers.bytes.per.reducer=<number>
   In order to limit the maximum number of reducers:
     set hive.exec.reducers.max=<number>
   In order to set a constant number of reducers:
     set mapreduce.job.reduces=<number>
   Starting Job = job_1589169585068_5807, Tracking URL = 
http://cdh-master:8088/proxy/application_1589169585068_5807/
   Kill Command = 
/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/bin/../lib/hadoop/bin/hadoop 
job  -kill job_1589169585068_5807
   Hadoop job information for Stage-1: number of mappers: 1; number of 
reducers: 1
   2020-06-16 19:52:49,232 Stage-1 map = 0%,  reduce = 0%
   2020-06-16 19:52:54,366 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
1.66 sec
   2020-06-16 19:53:00,516 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
3.71 sec
   MapReduce Total cumulative CPU time: 3 seconds 710 msec
   Ended Job = job_1589169585068_5807
   Launching Job 2 out of 4
   Number of reduce tasks not specified. Estimated from input data size: 1
   In order to change the average load for a reducer (in bytes):
     set hive.exec.reducers.bytes.per.reducer=<number>
   In order to limit the maximum number of reducers:
     set hive.exec.reducers.max=<number>
   In order to set a constant number of reducers:
     set mapreduce.job.reduces=<number>
   Starting Job = job_1589169585068_5808, Tracking URL = 
http://cdh-master:8088/proxy/application_1589169585068_5808/
   Kill Command = 
/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/bin/../lib/hadoop/bin/hadoop 
job  -kill job_1589169585068_5808
   Hadoop job information for Stage-4: number of mappers: 1; number of 
reducers: 1
   2020-06-16 19:53:07,019 Stage-4 map = 0%,  reduce = 0%
   2020-06-16 19:53:12,151 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 
1.32 sec
   2020-06-16 19:53:18,296 Stage-4 map = 100%,  reduce = 100%, Cumulative CPU 
4.69 sec
   MapReduce Total cumulative CPU time: 4 seconds 690 msec
   Ended Job = job_1589169585068_5808
   Stage-7 is selected by condition resolver.
   Stage-2 is filtered out by condition resolver.
   Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; 
support was removed in 8.0
   Execution log at: 
/tmp/root/root_20200616195252_5869795b-713e-4066-b062-48c0ff08d4a1.log
   2020-06-16 07:53:21  Starting to launch local task to process map join;      
maximum memory = 1908932608
   2020-06-16 07:53:22  Dump the side-table for tag: 1 with group count: 0 into 
file: 
file:/tmp/root/f4ddf013-1d00-4e3e-9b46-c4ffbfa79e0e/hive_2020-06-16_19-52-43_179_1208261470870565365-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile21--.hashtable
   2020-06-16 07:53:22  Uploaded 1 File to: 
file:/tmp/root/f4ddf013-1d00-4e3e-9b46-c4ffbfa79e0e/hive_2020-06-16_19-52-43_179_1208261470870565365-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile21--.hashtable
 (260 bytes)
   2020-06-16 07:53:22  End of local task; Time Taken: 0.704 sec.
   Execution completed successfully
   MapredLocal task succeeded
   Launching Job 4 out of 4
   Number of reduce tasks is set to 0 since there's no reduce operator
   Starting Job = job_1589169585068_5809, Tracking URL = 
http://cdh-master:8088/proxy/application_1589169585068_5809/
   Kill Command = 
/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/bin/../lib/hadoop/bin/hadoop 
job  -kill job_1589169585068_5809
   Hadoop job information for Stage-5: number of mappers: 1; number of 
reducers: 0
   2020-06-16 19:53:28,514 Stage-5 map = 0%,  reduce = 0%
   2020-06-16 19:53:34,689 Stage-5 map = 100%,  reduce = 0%, Cumulative CPU 
3.19 sec
   MapReduce Total cumulative CPU time: 3 seconds 190 msec
   Ended Job = job_1589169585068_5809
   Loading data to table 
lacus.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
 partition (dict_column=KYLIN_MAX_DISTINCT_COUNT)
   Partition 
lacus.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value{dict_column=KYLIN_MAX_DISTINCT_COUNT}
 stats: [numFiles=1, numRows=2, totalSize=77, rawDataSize=75]
   MapReduce Jobs Launched: 
   Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.71 sec   HDFS Read: 
565472 HDFS Write: 198 SUCCESS
   Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 4.69 sec   HDFS Read: 
8598 HDFS Write: 96 SUCCESS
   Stage-Stage-5: Map: 1   Cumulative CPU: 3.19 sec   HDFS Read: 6382 HDFS 
Write: 274 SUCCESS
   Total MapReduce CPU Time Spent: 11 seconds 590 msec
   OK
   Time taken: 53.08 seconds
   ```
   
   
   ### Clean up
   
   ```shell
   Hive table 
LACUS.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970
 is dropped. 
   Hive table 
LACUS.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970__distinct_value
 is dropped. 
   Hive table 
LACUS.kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970_global_dict
 is dropped. 
   Path 
[hdfs://cdh-master:8020/LACUS/LACUS/kylin-23742cdf-63f9-bfb0-a446-201795163dd1/kylin_intermediate_useractioncubebyhive_no2_e99a9c08_3437_06d8_796f_807dd224a970]
 is deleted. 
   ```
   
   <img width="1130" alt="image" 
src="https://user-images.githubusercontent.com/14030549/84773097-7fe58380-b00e-11ea-914c-45164c43e0da.png";>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> Build Global Dict by MR/Hive, new config
> ----------------------------------------
>
>                 Key: KYLIN-4343
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4343
>             Project: Kylin
>          Issue Type: Sub-task
>            Reporter: wangxiaojing
>            Assignee: wangxiaojing
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KYLIN-4343) Build Global Dict by MR/Hive, new config

Reply via email to