[
https://issues.apache.org/jira/browse/HIVE-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059280#comment-17059280
]
Sungwoo commented on HIVE-21164:
--------------------------------
I have tested this commit (in the master branch) with TPC-DS datasets, and find
that it fails to create ORC tables from text tables. Our testing proceeds as
follows. All the datasets are created on S3 simulated with MinIO, not on HDFS.
1. Generate a TPC-DS dataset in text format.
2. Create an external table. Here is an example of the command (where some
lines are omitted):
{code:sql}
create external table store_sales
(
ss_sold_date_sk bigint,
...
ss_net_profit double
)
row format delimited fields terminated by '|'
location 's3a://tmp/tpcds-generate/2/store_sales';
{code}
3. Create a managed ORC table. Here is an example of the command (where some
lines are omitted):
{code:sql}
create table store_sales
(
ss_sold_time_sk bigint,
...
ss_net_profit double
)
partitioned by (ss_sold_date_sk bigint)
stored as orc
TBLPROPERTIES('transactional'='true', 'transactional_properties'='default');
from tpcds_text_2.store_sales ss
insert overwrite table store_sales partition (ss_sold_date_sk)
select
ss.ss_sold_time_sk,
...
ss.ss_net_profit,
ss.ss_sold_date_sk
where ss.ss_sold_date_sk is not null
insert overwrite table store_sales partition (ss_sold_date_sk)
select
ss.ss_sold_time_sk,
...
ss.ss_net_profit,
ss.ss_sold_date_sk
where ss.ss_sold_date_sk is null
sort by ss.ss_sold_date_sk
;
{code}
3. The result is that a new ORC table store_sales is created, but it contains
no row.
{code:sh}
0: jdbc:hive2://orange1:9852/> select count(*) from store_sales;
...
+----------+
| _c0 |
+----------+
| 5501397 |
+----------+
1 row selected (0.766 seconds)
0: jdbc:hive2://orange1:9852/> select * from store_sales limit 100;
...
No rows selected (94.835 seconds)
{code}
The directory for the new table contains subdirectories as expected, but
contains no contents, e.g.:
{code:none}
store_sales/ss_sold_date_sk=2451465
store_sales/ss_sold_date_sk=2451465/base_0000001
store_sales/ss_sold_date_sk=2451465/base_0000001/_orc_acid_version
store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/base_0000001
store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/base_0000001/_orc_acid_version
{code}
4. What happens during the creating of the ORC table is:
1) The query completes successfully.
2) All intermediate data is created (over 300 megabytes), but deleted at the
end.
The relevant parts of the HiveServer2 log look like:
{code:none}
2020-03-13T20:21:12,669 INFO [HiveServer2-Background-Pool: Thread-259]
FileOperations: Reading manifest
s3a://hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/store_sales/_tmp.base_0000001/000013_0.manifest
2020-03-13T20:21:13,062 INFO [HiveServer2-Background-Pool: Thread-259]
FileOperations: Expected level of nesting (2) is not present in
_tmp.base_0000001/000000_0.manifest (from
s3a://hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/store_sales/_tmp.base_0000001/000000_0.manifest)
...
2020-03-13T20:21:13,233 INFO [HiveServer2-Background-Pool: Thread-259]
FileOperations: Deleting manifest directory
s3a://hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/store_sales/_tmp.base_0000001
2020-03-13T20:21:13,438 INFO [HiveServer2-Background-Pool: Thread-259]
FileOperations: Deleting
s3a://hivemr3/warehouse/tpcds_bin_partitioned_orc_2.db/store_sales/ss_sold_date_sk=2451257/base_0000001/bucket_00003_0
that was not committed
{code}
>From the log file, all the bucket files are deleted at the end.
This behavior is first introduced in this commit because the previous commit in
the master branch of Hive, namely 'HIVE-22816 : QueryCache: Queries using views
can have them cached after CTE expansion' of February 20, does not show the
same behavior and produces the correct result.
If you think I am missing something in the setup of the experiment, please let
me know. If anybody updates the patch, I can quickly run the experiment and
report the result back. Unfortunately I have not figured out why this commit
produces a different result.
> ACID: explore how we can avoid a move step during inserts/compaction
> --------------------------------------------------------------------
>
> Key: HIVE-21164
> URL: https://issues.apache.org/jira/browse/HIVE-21164
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Affects Versions: 3.1.1
> Reporter: Vaibhav Gumashta
> Assignee: Marta Kuczora
> Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21164.1.patch, HIVE-21164.10.patch,
> HIVE-21164.11.patch, HIVE-21164.11.patch, HIVE-21164.12.patch,
> HIVE-21164.13.patch, HIVE-21164.14.patch, HIVE-21164.14.patch,
> HIVE-21164.15.patch, HIVE-21164.16.patch, HIVE-21164.17.patch,
> HIVE-21164.18.patch, HIVE-21164.19.patch, HIVE-21164.2.patch,
> HIVE-21164.20.patch, HIVE-21164.21.patch, HIVE-21164.22.patch,
> HIVE-21164.3.patch, HIVE-21164.4.patch, HIVE-21164.5.patch,
> HIVE-21164.6.patch, HIVE-21164.7.patch, HIVE-21164.8.patch, HIVE-21164.9.patch
>
>
> Currently, we write compacted data to a temporary location and then move the
> files to a final location, which is an expensive operation on some cloud file
> systems. Since HIVE-20823 is already in, it can control the visibility of
> compacted data for the readers. Therefore, we can perhaps avoid writing data
> to a temporary location and directly write compacted data to the intended
> final path.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)