Github user amberarrow commented on a diff in the pull request:

    https://github.com/apache/apex-malhar/pull/368#discussion_r74611442
  
    --- Diff: docs/operators/fsInputOperator.md ---
    @@ -0,0 +1,101 @@
    +File Input Operator
    +=============
    +
    +## Operator Objective
    +This operator scans a directory for files. Files are then read and split 
into tuples, which are emitted. The default implementation scans a single 
directory. The operator is fault tolerant. It tracks previously read files and 
current offset as part of checkpoint state. In case of failure the operator 
will skip files that were already processed and fast forward to the offset of 
the current file. Supports partitioning and changes to number of partitions. 
The directory scanner is responsible to only accept the files that belong to a 
partition.
    +
    +File Input Operator is **idempotent**, **fault-tolerant** and 
**partitionable**.
    +
    +## Operator Usecase
    +1. Read all files of a directory and then keep scanning it for newly added 
files.
    +
    +## Operator Information
    +1. Operator location: ***malhar-library***
    +2. Available since: ***1.0.2***
    +3. Operator state: ***Stable***
    +3. Java Packages:
    +    * Operator: 
***[com.datatorrent.lib.io.fs.AbstractFileInputOperator](https://www.datatorrent.com/docs/apidocs/com/datatorrent/lib/io/fs/AbstractFileInputOperator.html)***
    +
    +### AbstractFileInputOperator
    +This is the abstract implementation that serves as base class for scanning 
a directory for files and read the files one by one. This class doesn’t have 
any ports.
    --- End diff --
    
    Add an overview here of what the operator does here, what parts need to be 
implemented by concrete subclasses and show code fragments for example from the 
LineByLine operator. Describe what the DirectoryScanner does. Some obvious 
questions that will come up in the reader's mind should be answered, for 
example:
    1. What happens if a file that has already been processed has new data 
appended to it ?
    2. What happens if a processed file is deleted ?
    3. What happens if a separate actor is in the process of writing a new file 
in the monitored directory ? How will this operator know when the file is ready 
to be read ?
    4. What if the number of files in the scanned directory grows over time to 
be very large -- what impact, if any, does it have on this operator ?
    5. Can this operator be used to monitor multiple directories ? (Point 
people to the fileIO-multidir example)
    6. This class already implements the Partitioner interface; can a custom 
partitioner be set on this operator ? More generally, since we are saying this 
operator supports dynamic partitioning, it is useful to describe (with a code 
fragment if possible) how to trigger it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to