Agreed, there will be be applications that write to many files that cannot be all remain open forever.
Can you provide an example on how to modify the append behavior depending on HFS implementation? https://malhar.atlassian.net/browse/MLHR-1888 On Tue, Nov 3, 2015 at 9:35 AM, Chandni Singh <[email protected]> wrote: > Hi, > > Please look at the latest changes to this operator. > These changes enable overriding stream opening and closing. Implementation > can control how they want to achieve append() if at all. > > This operator from its conception is based on a cache of open streams which > has a maximum size which that if at any point of time that limit is near, > the cache will evict entries (close streams). Another setting is expiry > time which evicts and closes a stream when it hasn't been accessed for a > while in the cache. > > If the user wants to actually never close the stream they can initialize > both these values to their respective max values. But in an real case > scenario the user needs to know that when a file will be eventually closed > (never written to) and using that information they can configure these > settings or again initialize them to their max and close the streams > explicitly. > > Let's say if we don't have this cache and we are writing to multiple files. > Then that implies that multiple streams will always hang around in memory > (even if they weren't accessed) all the time. This in my opinion is a > problematic design which will cause bigger issues like out of memory all > the time. > > Chandni > > > On Tue, Nov 3, 2015 at 7:58 AM, Thomas Weise <[email protected]> > wrote: > > > Append is used to continue writing to files that were closed and left in > a > > consistent state before. When append is not available, then we would need > > to disable the optimization to close and reopen files? > > > > > > On Tue, Nov 3, 2015 at 6:14 AM, Munagala Ramanath <[email protected]> > > wrote: > > > > > Shouldn't "append" be a user-configurable property which, if false, > > causes > > > the > > > file to be overwritten ? > > > > > > Ram > > > > > > On Mon, Nov 2, 2015 at 10:51 PM, Priyanka Gugale > > > <[email protected]> wrote: > > > > Hi, > > > > > > > > AbstractFileOutputOperator is used to write output files. The > operator > > > has > > > > a method "getFSInstance". This initializes file system. One can > > override > > > > the method to initialize desired file system which extends hadoop > > > > FileSystem. In our implementation we have overridden "getFSInstance" > to > > > > initialize FTPFileSystem. > > > > > > > > The file loader code in setup method of AbstractFileOutputOperator > > opens > > > > the file in append mode when file is already present. The issue is > > > > FTPFileSystem doesn't support append function. > > > > > > > > The solution to problem could be: > > > > 1. Override append method in FTPFileSystem. > > > > -This would be tricky as file system doesn't support the > operation. > > > And > > > > there are other file systems as well like S3 which also don't support > > > > append. > > > > 2. Avoid using functions like "append" which are not supported by > some > > of > > > > the implementations of Hadoop FileSystem. > > > > 3. Write file loading logic (which is in setup method) in functions > > which > > > > can be extended by subclass to override the logic to load files (by > > > > avoiding using calls like append which are not supported by user's > > chosen > > > > file system). > > > > > > > > -Priyanka > > > > > >
