BsoBird commented on issue #7406: URL: https://github.com/apache/iceberg/issues/7406#issuecomment-1526907138
@bharos Thank you very much for your reply. But the solution you mentioned I think there are some problems as follows: 1. When I write a large number of datasets to a large number of partitions, because we need to write data to a large number of partitions for each TASK, so each TASK will create a lot of FlieWriter objects, which will quickly lead to the OOM of the job. 2. 3. In the above case, we will create a large number of handles on the server, the load on the server will rapidly increase, and the risk of downtime increases. 4. even if the job can be completed, we will get a bunch of small files, which will lead to a significant drop in performance of analysis tasks based on this table. If I am using HIVE, when I write to the partition table, I tend to do the following. ··· insert overwrite table partition(part) select id,name,part from source_table distribute by part //or distribute by hash(part)%1024 ··· The data is bucketed before it is written to the partition table. Each TASK writes a relatively small number of partitions, so the writes are quite acceptable. This SQL statement tends to take longer to run because I redistribute the data. When `write.distribution-mode=HASH`, ICEBERG works in the same way as the SQL above. But the problem is, ICEBERG has hidden partitions, so I can't control the data distribution manually outside. And our jobs are basically SQL jobs, our users can't accept to use SPARK API to redevelop their jobs. So now the situation is rather awkward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
