steveloughran commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives 
duplicate result when an application was killed
URL: https://github.com/apache/spark/pull/25795#issuecomment-535492981
 
 
   @hkkolodner -thanks for the clarification.
   
   FWIW, in https://github.com/apache/hadoop/pull/1442 I am actually stripping 
back on enumerating all files in the _SUCCESS manifest I've been creating (for 
test validation only) because at a sufficiently large terasort and TCP-DS 
scale, you end up with memory issues in job commits, which makes my colleagues 
trying to do these things unhappy. I may be overreacting (it's happening at an 
earlier stage), but I'm just being thorough in reviewing datastructures built 
in a commit. And even if there's enough heap, I don't want to force a many-MB 
upload at the end of every query, as that becomes a bottleneck of its own.
   
   Like I say, the post-directory-listing table layouts are the inevitable 
future. That directory tree has been great: tool neutral, easy to navigate by 
hand, etc, but it hits scale limits even in HDFS, serious perf limits in 
higher-latency stores and the commit-by-rename mechanism is running out of 
steam.
   
   I don't have enough experience of any of the alternatives to have any strong 
opinions, except to conclude that yes, they are inevitable. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to