[GitHub] [iceberg] fengguangyuan commented on issue #3340: Should be synchronized on current()/refresh() overrode in HiveTableOperations?

via GitHub Fri, 17 Feb 2023 05:02:46 -0800


fengguangyuan commented on issue #3340:
URL: https://github.com/apache/iceberg/issues/3340#issuecomment-1434622518


   > > # Why the problem ?
   > > Considering the following cases, each `AppendFiles` instance may hold a 
stale table metadata instance (referenced by `base` defined as a member 
variable in `SnapshotProducer`), because of some new snapshots committed by 
other threads or tasks:
   > > ```java
   > > AppendFiles af1 = table.newAppend().addFile(thread-1.file);
   > > AppendFiles af2 = table.newAppend().addFile(thread-2.file);
   > > AppendFiles af3 = table.newAppend().addFile(thread-3.file);
   > > ...
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > With so many `AppendFiles` existed, the referenced staled 
`TableMetadata` instances also won't be reclaimed by GC in time, and as we know 
that the size of TableMetadata instance is increased along with the number of 
snapshots, in consequence, the GC issues come, commonly seeing `GC overhead 
limited exceed` error.
   > 
   > Hi @fengguangyuan! Firstly, thank you for your contributions.
   > 
   > I wanted to ask you about your comment here about the GC overhead limit 
exceeded and your concern with having too many snapshots (which definitely is 
valid in general, though there are configurations to rewrite snapshots after a 
certain number and table maintenance operations to keep snapshots at a healthy 
amount of your needs). I'm hoping you can help me understand more your 
practices: how you're using the library to achieve this additional parallelism 
(is this via `.par` to make a Scala parallel collection with Spark, do you have 
custom code, is it just a configuration property you've raised)? Also, what 
catalog are you using and what filestore are you writing to? And also 
(important but I understand if it's not easy to answer right away), at what 
rate are you accumulating additional snapshots based on the your incoming data 
(maybe how many files do you have per snapshot in general)? And also, how often 
are you calling `AppendFile` per commit (as you mentioned `AppendFile` so 
 I'm wondering if you're using the library a bit more directly or what).
   > 
   > I know this isn't necessarily a small ask, but you've given a very 
thorough description here, and I'd really like to better understand the 
problems that you're seeing arise as well as your usage of the library. I think 
it would be really valuable.
   > 
   > Anything you can provide, starting with the basics of:
   > 
   > * system you're using for writing (eg Flink, Spark, Trino, etc),
   > * catalog you're using
   > * iceberg version you're using
   > * any non-default configuration values you're setting, particularly that 
would affect commit rate and snapshot production
   > 
   > Also, ideally:
   > 
   > * How are you achieving this added parallelism (a config value, high level 
code in your job, code that you've written using the Iceberg library, etc)
   > * If you've written your own code using the Java API, the relationship 
between commits and AppendaFile
   > 
   > Anything you can share would greatly help in understanding your use case 
and the problems you're facing more. And there might be learnings to be had 
from your usage. 🙂
   > 
   > Thanks, Kyle!
   Best regards, Kyle :)
   
   So sorry too long to no reply. Thanks for you recommendations, it's a lesson 
has taken to my heart!
   
   After known more abort the code,  it's indeed a bad case to share a table in 
different threads.
   It's the caller's responsibilities  NOT to do like that, to avoid generating 
a large amount staled `metadata` objects with Hive operations in a short time, 
leading to the pressure on JVM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] fengguangyuan commented on issue #3340: Should be synchronized on current()/refresh() overrode in HiveTableOperations?

Reply via email to