LucasRoesler commented on issue #7657:
URL: https://github.com/apache/iceberg/issues/7657#issuecomment-1554741515

   We suspected the `rewrite-datafiles` because it is the only change that we 
know happened around the time that we stopped loading data. 
   
   Regarding 
   >  If you are confident it does not need to be shuffled you can always set 
the mode to none.
   
   We are currently just grabbing a batch of message from kafka and then we set 
the loaded time to the current timestamp. I suspect we don't need a suffle 
here, all of the data should just go to the end of the current partition. I am 
not 100% sure though,
   
   
   Regarding the version, we did update the iceberg version while testing 
various fixes for this, from 1.2.0 to 1.2.1. But it had been running 1.2.0 for 
a couple weeks or longer. If I read the changelog correctly, 1.2.0 already had 
this change to the `write.distribution-mode`.
   
   
   
   Part of the reason we suspected that it was trying to sort the full table is 
because it is the only thing that has as much data as it was claiming to have 
loaded, at one point it had sorted through 69 million rows. The kafka topic 
simply didn't have that much data in it for the time period specified 
   
   For example, this is one of the screenshots I can find in our slack 
   <details>
   
   
![image](https://github.com/apache/iceberg/assets/891889/79fc2e95-843b-4713-a358-3135d05d5080)
   
   </details>
   
   Compared to the same screen for the currently running stream, i can't find 
any that show more rows loaded
   <details>
   
   
![image](https://github.com/apache/iceberg/assets/891889/41a3d4da-1019-40cd-a011-fc5282db17ea)
   
   </details>
   
   
   Additionally, a single partition doesn't really have enough data to account 
for the same of that `ExistingRDD`
   
   This is the metadata for the last handful of partitions in the new table 
   <details>
   
   
![image](https://github.com/apache/iceberg/assets/891889/b43f679d-7157-4aca-8bc2-b662727c41df)
   
   </details>
   
   For comparison, the same partition data for the original table
   
   <details>
   
   
![image](https://github.com/apache/iceberg/assets/891889/84a56164-fd03-4bd7-a377-024475b32f2a)
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to