BsoBird commented on issue #7406:
URL: https://github.com/apache/iceberg/issues/7406#issuecomment-1526907138

   @bharos 
   Thank you very much for your reply. 
   But the solution you mentioned I think there are some problems as follows: 
   1. When I write a large number of datasets to a large number of partitions, 
because we need to write data to a large number of partitions for each TASK, so 
each TASK will create a lot of FlieWriter objects, which will quickly lead to 
the OOM of the job.
   2.  3. In the above case, we will create a large number of handles on the 
server, the load on the server will rapidly increase, and the risk of downtime 
increases. 
   4. even if the job can be completed, we will get a bunch of small files, 
which will lead to a significant drop in performance of analysis tasks based on 
this table.
   
   If I am using HIVE, when I write to the partition table, I tend to do the 
following.
   ···
   insert overwrite table partition(part)
   select id,name,part from source_table
   distribute by part  //or distribute by hash(part)%1024
   ···
   The data is bucketed before it is written to the partition table. Each TASK 
writes a relatively small number of partitions, so the writes are quite 
acceptable. 
   This SQL statement tends to take longer to run because I redistribute the 
data.
   
   When `write.distribution-mode=HASH`, ICEBERG works in the same way as the 
SQL above. 
   But the problem is, ICEBERG has hidden partitions, so I can't control the 
data distribution manually outside. 
   And our jobs are basically SQL jobs, our users can't accept to use SPARK API 
to redevelop their jobs. 
   So now the situation is rather awkward.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to