Just saw that there is *HDFSFileSplitter* in the library as well. This sets *ignoreFilePatternRegularExp *to ".*._COPYING_" and *unsupportedChar* to ":",
IMO this class should be removed as well. Chandni On Fri, May 6, 2016 at 4:16 PM, Chandni Singh <[email protected]> wrote: > Hi, > > Recently there was FSFileSplitter added to Malhar library. > I have created https://issues.apache.org/jira/browse/APEXMALHAR-2081 to > remove this operator and adds its functionality to the FileSplitterInput. > > The reason to do so is because this extension just adds 3 trivial features > which makes it difficult for the user to know which operator to use. It > adds more classes which essentially do the same thing. > > This operator add 3 properties to FileSplitterInput. > > 1. ignoreFilePatternRegularExp: regular expression that specifies which > files to ignore. > This is useful to have in the FileSplitterInput. > > 2. unsupportedChar: first of all this is a String. File having this String > will be ignored. > IMO this is redundant. #1 can be used to accomplish this. > I think this should be removed. > > 3. sequentialFileReader: when this property is set, the block metadata of > the same files have the same hashcode. This I think may have been done so > that all the block metadata of a particular file go to the same block > reader. > > IMO this is a hacky way of accomplishing this. If an application needs > this then this should have been done using a StreamCodec. > > I think this should be removed. > > Thanks, > Chandni >
