[GitHub] spark pull request: [SPARK-7889] [CORE] HistoryServer to refresh c...

steveloughran Wed, 23 Dec 2015 10:25:24 -0800

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6935#discussion_r48368082
  
    --- Diff: docs/monitoring.md ---
    @@ -69,36 +83,53 @@ follows:
       </tr>
     </table>
     
    +### Spark configuration options
    +
     <table class="table">
       <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
       <tr>
         <td>spark.history.provider</td>
    -    <td>org.apache.spark.deploy.history.FsHistoryProvider</td>
    +    <td><code>org.apache.spark.deploy.history.FsHistoryProvider</code></td>
         <td>Name of the class implementing the application history backend. 
Currently there is only
         one implementation, provided by Spark, which looks for application 
logs stored in the
         file system.</td>
       </tr>
       <tr>
    +    <td>spark.history.retainedApplications</td>
    +    <td>50</td>
    +    <td>
    +      The number of application UIs to retain. If this cap is exceeded, 
then the oldest
    +      applications will be removed.
    +    </td>
    +  </tr>
    +  <tr>
         <td>spark.history.fs.logDirectory</td>
         <td>file:/tmp/spark-events</td>
         <td>
    -     Directory that contains application event logs to be loaded by the 
history server
    +    For the filesystem history provider, the URL to the directory 
containing application event
    +    logs to load. This can be a local <code>file://</code> path,
    +    an HDFS path <code>hdfs://namenode/shared/spark-logs</code>
    +    or that of an alternative filesystem supported by the Hadoop APIs.
         </td>
       </tr>
       <tr>
         <td>spark.history.fs.update.interval</td>
         <td>10s</td>
         <td>
    -      The period at which information displayed by this history server is 
updated.
    -      Each update checks for any changes made to the event logs in 
persisted storage.
    +      The period at which the the filesystem history provider checks for 
new or
    +      updated logs in the log directory. A shorter interval detects new 
applications faster,
    +      at the expense of more server load re-reading updated applications.
    +      As soon as an update has completed, listings of the completed and 
incomplete applications
    +      will reflect the changes. For performance reasons, the UIs of web 
applications are
    +      only updated at a slower interval, that defined in 
<code>spark.history.cache.window</code> 
    --- End diff --
    
    There's three costs in the system: listing cost, probe cost and replay 
costs.
    
    * listing cost is pretty expensive in the history server, as it replays the 
entire history just to get a few flags which could be  cached alongside 
(completed flag, etc). That's why it can be slow to startup. After startup the 
async replay is only done on changed data. Load on HDFS: negligible.
    * probe cost: simply checking the internal state of things updated in the 
update thread., ~0
    * replay cost: expensive, O(events), so essentially O(filesize). Again, 
HDFS doesn't notice.
    
    The rationale for having a probe interval is not so much code cost, but 
replay costs: have a 15s probe interval would mean "a user clicking through the 
UI of a busy app could trigger a reload every 15s". I don't have the stats to 
decide good or bad that is, but a longer interval worries me less.
    
    FWIW, the Yarn timeline provider costs are
     -listing cost is  less expensive than for the FS history provider, but it 
does move some of the load into the timeline server (search of database, 
serialization of result).
     -probe cost. ~0 again
     -replay cost., same replay costs as for the FS, but now with json 
serialization and transmission over HTTP to add.
    
    I suspect there you'd want a longer interval for probes, just to keep those 
replays down.
    
    Again: more data is needed here. I've added the metrics to the cache as a 
start to that âadd metrics publishing to the history server and this code is 
ready to be hooked up and so show the numbers on cache reload operations



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7889] [CORE] HistoryServer to refresh c...

Reply via email to