[ https://issues.apache.org/jira/browse/APEXMALHAR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468563#comment-15468563 ]
Chandni Singh commented on APEXMALHAR-2223: ------------------------------------------- A possible approach to address this: - Have a property in ManagedState called {code}writeBufferThreshold{code}. When a bucket size crosses this threshold, then the bucket is eligible for writing. - The writing to WAL of eligible buckets is done at the end of every application window() in {code}endWindow(){code} callback. With this approach there are fewer changes where data is still divided into windows when written to the WAL. > Managed state should parallelize WAL writes > ------------------------------------------- > > Key: APEXMALHAR-2223 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2223 > Project: Apache Apex Malhar > Issue Type: Improvement > Affects Versions: 3.4.0 > Reporter: Thomas Weise > Assignee: Chandni Singh > > Currently, data is accumulated in memory and written to the WAL on checkpoint > only. This causes a write spike on checkpoint and does not utilize the HDFS > write pipeline. The other extreme is writing to the WAL as soon as data > arrives and then only flush in beforeCheckpoint. The downside of this is that > when the same key is written many times, all duplicates will be in the WAL. > Need to find a balances approach, that the user can potentially fine tune. -- This message was sent by Atlassian JIRA (v6.3.4#6332)