[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313606#comment-15313606
 ] 

Chandni Singh edited comment on APEXMALHAR-2063 at 6/3/16 6:24 AM:
-------------------------------------------------------------------

- A single WAL part file may consists of several window state (state per 
finished window). So the writer and reader may be accessing the same part file 
but readers will be always behind the writer and there will never be more than 
one writer. 

-It is necessary to persist the meta information at the end of every window 
because WindowDataManager needs to replay data of every finished window which 
is not necessarily checkpointed.

There can be 2 ways of doing this:
1. Meta data is persisted with the actual data in the WAL files however this 
makes reading data of a particular window inefficient. We have to start from 
the beginning of the WAL and read till we find the window.

2. Keep a separate meta file. This is a single meta file which contains 
pointers per window and updated every window. Reading is efficient here. 
I think the goal was to avoid creating new small files every window (Having 
multiple small size files on hdfs cause issues as highlighted in 
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/). This will not 
create multiple small size file.




was (Author: csingh):

- A single WAL part file may consists of several window state (state per 
finished window). So the writer and reader may be accessing the same part file 
but readers will be always behind the writer and there will never be more than 
one writer. 

-It is necessary to persist the meta information at the end of every window 
because WindowDataManager needs to replay data of every finished window which 
is not necessarily checkpointed.

There can be 2 ways of doing this:
1. Meta data is persisted with the actual data in the WAL files however this 
makes reading data of a particular window inefficient. We have to start from 
the beginning of the WAL and read till we find the window.

2. Keep a separate meta file. This is a single meta file which contains 
pointers per window and updated every window. Reading is efficient here. 
I think the goal was to avoid creating new small files every window (Having 
multiple small size files on hdfs cause issues as highlighted in 
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/). Here we aren't 
creating new file just re-writing a single meta file. 



> Integrate WAL to FS WindowDataManager
> -------------------------------------
>
>                 Key: APEXMALHAR-2063
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2063
>             Project: Apache Apex Malhar
>          Issue Type: Improvement
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>
> FS Window Data Manager is used to save meta-data that helps in replaying 
> tuples every completed application window after failure. For this it saves 
> meta-data in a file per window. Having multiple small size files on hdfs 
> cause issues as highlighted here:
> http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> Instead FS Window Data Manager can utilize the WAL to write data and maintain 
> a mapping of how much data was flushed to WAL each window. 
> In order to use FileSystemWAL for replaying data of a finished window, there 
> are few changes made to FileSystemWAL this is because of following:
> 1. WindowDataManager needs to reply data of every finished window. This 
> window may not be checkpointed. 
> FileSystemWAL truncates the WAL file to the checkpointed point after recovery 
> so this poses a problem. 
> WindowDataManager should be able to control recovery of FileSystemWAL.
> 2.  FileSystemWAL writes to temporary files. The mapping of temp files to 
> actual file is part of its state which is checkpointed. Since 
> WindowDataManager replays data of a window not yet checkpointed, it needs to 
> know the actual temporary file the data is being persisted to.
> At a high level, WindowDataManager will persist meta information on file 
> system which includes following details for every window 
> - start wal pointer
> - end was pointer
> - wal file path
> This is a single file which is updated every end-window along with the actual 
> data in WAL file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to