----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/52570/#review163011 -----------------------------------------------------------
docs/learn/documentation/versioned/hdfs/consumer.md (line 22) <https://reviews.apache.org/r/52570/#comment234385> Can re-word docs for clarity. 1. Remove `Now` here: s/Now// - "You can configure your Samza jobs to read data from HDFS files". 2. No need to mention that it's implemented in `samza-hdfs`. 3. Substitute: "Implemented in `samza-hdfs`, the `HdfsSystemConsumer` will read from HDFS files (`avro` records only for now. /->> "Only avro encoded records are supported for now." 4. "very easy to extend to plain text, csv, json, etc. in the future" -> How is this supported? A hint on how to do this is useful. If not, we can nuke this. 5. Reword: "Deliver messages to your tasks" can be left. docs/learn/documentation/versioned/hdfs/consumer.md (line 26) <https://reviews.apache.org/r/52570/#comment234404> Thanks for calling this out explicitly. docs/learn/documentation/versioned/hdfs/consumer.md (line 30) <https://reviews.apache.org/r/52570/#comment234406> Do you think it can be more concise: - "The way `HDFSSystemConsumer` does partitioning is basically to treat each HDFS file as a partition. Using Kafka as an analogy, a HDFS directory is close to the concept of a Kafka topic, and all the files within the directory are like topic partitions in the sense of Kafka. " Reword it to be: (something on the lines of.) "Partitioning works at the level of individual HDFS files. Each file is treated as a stream partition." - This line can be removed completely. The subsequent examples expands on the idea. docs/learn/documentation/versioned/hdfs/consumer.md (line 34) <https://reviews.apache.org/r/52570/#comment234407> Reword: You can configure upto 10 Samza containers to process these partitions. docs/learn/documentation/versioned/hdfs/consumer.md (line 40) <https://reviews.apache.org/r/52570/#comment234412> 1. Not sure we have a noun called `payload` anywhere else in the Samza docs. 2. Can be more precise in wording this. The received IncomingMessageEnvelope contains three significant fields. - The key which is empty. - The message which is set to the avro GenericRecord. - The stream partition which is set to the name of the HDFS file. 3. Please link the IncomingMessageEnvelope API. docs/learn/documentation/versioned/hdfs/consumer.md (line 42) <https://reviews.apache.org/r/52570/#comment234414> 1. Please link the javadocs of SingleFileHDFSReader 2. Suggest re-word: 2.1 one can implement/ you can implement (Since, the rest of the documentation is directed at the user) 2.2 The second line on "few methods need to be implemented" can be ignored. docs/learn/documentation/versioned/hdfs/consumer.md (line 43) <https://reviews.apache.org/r/52570/#comment234416> It's not super-obvious how the implementation of the `SingleFileHDFSReader` ties to the underlying consumer. Am I missing something? docs/learn/documentation/versioned/hdfs/consumer.md (line 46) <https://reviews.apache.org/r/52570/#comment234417> 1. Not sure we want to introduce a new noun called `data stream` in the docs. Suggest re-word to be precise: While a kafka topic has an unbounded stream of messages, HDFS files are bounded and have a notion of EOF. docs/learn/documentation/versioned/hdfs/consumer.md (line 48) <https://reviews.apache.org/r/52570/#comment234428> Reword docs to be concise: You can choose to implement `EndOfStreamListenerTask` to receive a callback when all partitions are at end of stream. When all partitions being processed by the task are at end of stream (ie. EOF has been reached for all files), the Samza job exits automatically. Remove `What happens when we finish reading all data from the HDFS files? The behavior is that the Samza job will shut itself down once the job is done with all files.` docs/learn/documentation/versioned/hdfs/consumer.md (line 67) <https://reviews.apache.org/r/52570/#comment234430> The relationship between whitelist and blacklist was not very obvious to me. Is the behavior that the whitelist is applied first, and the blacklist is applied to the matched files later? (to determine which files are to be ignored). docs/learn/documentation/versioned/hdfs/consumer.md (line 75) <https://reviews.apache.org/r/52570/#comment234433> 1.fix typo: required 2.For HDFS clusters that have kerberos enabled,... docs/learn/documentation/versioned/hdfs/consumer.md (line 77) <https://reviews.apache.org/r/52570/#comment234436> Suggestions on docs: 1. Not needed to mention the jira. 2. Not needed to mention the class name. The following additional configs are required when accessing HDFS clusters that have kerberos enabled. docs/learn/documentation/versioned/hdfs/consumer.md (line 93) <https://reviews.apache.org/r/52570/#comment234437> Might be worth presenting an abridged version of what advanced partitioning is supported? (or atleast, how it can be configured). docs/learn/documentation/versioned/hdfs/consumer.md (line 97) <https://reviews.apache.org/r/52570/#comment234439> Not clear to me how this differs from the whitelist (*.avro which specifies what files to process). docs/learn/documentation/versioned/hdfs/consumer.md (line 99) <https://reviews.apache.org/r/52570/#comment234441> 1. Please Fix typo: retries 2. Re-word to be precise: maximum number of retries (per-partition) before the container fails. docs/learn/documentation/versioned/hdfs/consumer.md (line 105) <https://reviews.apache.org/r/52570/#comment234443> We probably don't want to reference Jiras here in the docs. docs/learn/documentation/versioned/jobs/configuration-table.html (line 1819) <https://reviews.apache.org/r/52570/#comment234444> What's this configuration? Is this the number of messages? What are the implications of this? I'm not in favor of exposing this tunable if this is not super-significant. docs/learn/documentation/versioned/jobs/configuration-table.html (line 1824) <https://reviews.apache.org/r/52570/#comment234446> The number of retry attempts when there is a failure to fetch messages from HDFS. Also, should document the behavior after failures after numRetries? docs/learn/documentation/versioned/jobs/configuration-table.html (line 1829) <https://reviews.apache.org/r/52570/#comment234448> Docs for whitelist and blacklist are the same? Are both referring to `unwanted` files? Do we want to re-word them? docs/learn/documentation/versioned/jobs/configuration-table.html (line 1839) <https://reviews.apache.org/r/52570/#comment234452> Not clear how this is useful? This needs more docs. docs/learn/documentation/versioned/jobs/configuration-table.html (line 1842) <https://reviews.apache.org/r/52570/#comment234453> What are the implications of this config? It's not obvious to me from the docs. docs/learn/documentation/versioned/jobs/configuration-table.html (line 1847) <https://reviews.apache.org/r/52570/#comment234454> If this is purely internal, why do we need this config? If there is no need for this to be set, what's the default value? docs/learn/documentation/versioned/jobs/configuration-table.html (line 1849) <https://reviews.apache.org/r/52570/#comment234456> Also, the config-key convention is slightly different from the rest of the Samza configs: For example, there's camelCase for configs used in the consumer but the producer has a different config style. (producer.hdfs.compression.type versus producer.hdfsCompressionType) - Jagadish Venkatraman On Jan. 24, 2017, 2:07 a.m., Hai Lu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/52570/ > ----------------------------------------------------------- > > (Updated Jan. 24, 2017, 2:07 a.m.) > > > Review request for samza. > > > Bugs: SAMZA-1025 > https://issues.apache.org/jira/browse/SAMZA-1025 > > > Repository: samza > > > Description > ------- > > documentation for hdfs system consumer > > > Diffs > ----- > > docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION > docs/learn/documentation/versioned/hdfs/producer.md > b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 > docs/learn/documentation/versioned/index.html > d0b14ece94341e2cb937cf32db480e69f93303c2 > docs/learn/documentation/versioned/jobs/configuration-table.html > ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 > > Diff: https://reviews.apache.org/r/52570/diff/ > > > Testing > ------- > > N/A > > > Thanks, > > Hai Lu > >