Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

Jagadish Venkatraman Wed, 25 Jan 2017 14:36:54 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/#review163011
-----------------------------------------------------------





docs/learn/documentation/versioned/hdfs/consumer.md (line 22)
<https://reviews.apache.org/r/52570/#comment234385>

    Can re-word docs for clarity.
    1. Remove `Now` here: s/Now// - "You can configure your Samza jobs to read 
data from HDFS files". 
    2. No need to mention that it's implemented in `samza-hdfs`.
    3. Substitute: "Implemented in `samza-hdfs`, the `HdfsSystemConsumer` will 
read from HDFS files (`avro` records only for now. /->> "Only avro encoded 
records are supported for now."
    4. "very easy to extend to plain text, csv, json, etc. in the future" -> 
How is this supported? A hint on how to do this is useful. If not, we can nuke 
this.
    5. Reword: "Deliver messages to your tasks" can be left.



docs/learn/documentation/versioned/hdfs/consumer.md (line 26)
<https://reviews.apache.org/r/52570/#comment234404>

    Thanks for calling this out explicitly.



docs/learn/documentation/versioned/hdfs/consumer.md (line 30)
<https://reviews.apache.org/r/52570/#comment234406>

    Do you think it can be more concise:
    - "The way `HDFSSystemConsumer` does partitioning is basically to treat 
each HDFS file as a partition. Using Kafka as an analogy, a HDFS directory is 
close to the concept of a Kafka topic, and all the files within the directory 
are like topic partitions in the sense of Kafka. "
    
    Reword it to be: (something on the lines of.)
    "Partitioning works at the level of individual HDFS files. Each file is 
treated as a stream partition."
    
    - This line can be removed completely. The subsequent examples expands on 
the idea.



docs/learn/documentation/versioned/hdfs/consumer.md (line 34)
<https://reviews.apache.org/r/52570/#comment234407>

    Reword:
    
    You can configure upto 10 Samza containers to process these partitions.



docs/learn/documentation/versioned/hdfs/consumer.md (line 40)
<https://reviews.apache.org/r/52570/#comment234412>

    1. Not sure we have a noun called `payload` anywhere else in the Samza docs.
    
    2. Can be more precise in wording this.
    The received IncomingMessageEnvelope contains three significant fields.
    - The key which is empty.
    - The message which is set to the avro GenericRecord.
    - The stream partition which is set to the name of the HDFS file.
    
    3. Please link the IncomingMessageEnvelope API.



docs/learn/documentation/versioned/hdfs/consumer.md (line 42)
<https://reviews.apache.org/r/52570/#comment234414>

    1. Please link the javadocs of SingleFileHDFSReader
    2. Suggest re-word:
    2.1 one can implement/ you can implement (Since, the rest of the 
documentation is directed at the user)
    2.2 The second line on "few methods need to be implemented" can be ignored.



docs/learn/documentation/versioned/hdfs/consumer.md (line 43)
<https://reviews.apache.org/r/52570/#comment234416>

    It's not super-obvious how the implementation of the `SingleFileHDFSReader` 
ties to the underlying consumer. Am I missing something?



docs/learn/documentation/versioned/hdfs/consumer.md (line 46)
<https://reviews.apache.org/r/52570/#comment234417>

    1. Not sure we want to introduce a new noun called `data stream` in the 
docs.
    
    Suggest re-word to be precise:
    
    While a kafka topic has an unbounded stream of messages, HDFS files are 
bounded and have a notion of EOF.



docs/learn/documentation/versioned/hdfs/consumer.md (line 48)
<https://reviews.apache.org/r/52570/#comment234428>

    Reword docs to be concise:
    
    You can choose to implement `EndOfStreamListenerTask` to receive a callback 
when all partitions are at end of stream. When all partitions being processed 
by the task are at end of stream (ie. EOF has been reached for all files), the 
Samza job exits automatically.
    
    Remove `What happens when we finish reading all data from the HDFS files? 
The behavior is that the Samza job will shut itself down once the job is done 
with all files.`



docs/learn/documentation/versioned/hdfs/consumer.md (line 67)
<https://reviews.apache.org/r/52570/#comment234430>

    The relationship between whitelist and blacklist was not very obvious to me.
    
    Is the behavior that the whitelist is applied first, and the blacklist is 
applied to the matched files later? (to determine which files are to be 
ignored).



docs/learn/documentation/versioned/hdfs/consumer.md (line 75)
<https://reviews.apache.org/r/52570/#comment234433>

    1.fix typo: required
    2.For HDFS clusters that have kerberos enabled,...



docs/learn/documentation/versioned/hdfs/consumer.md (line 77)
<https://reviews.apache.org/r/52570/#comment234436>

    Suggestions on docs:
    
    1. Not needed to mention the jira.
    2. Not needed to mention the class name.
    
    The following additional configs are required when accessing HDFS clusters 
that have kerberos enabled.



docs/learn/documentation/versioned/hdfs/consumer.md (line 93)
<https://reviews.apache.org/r/52570/#comment234437>

    Might be worth presenting an abridged version of what advanced partitioning 
is supported? (or atleast, how it can be configured).



docs/learn/documentation/versioned/hdfs/consumer.md (line 97)
<https://reviews.apache.org/r/52570/#comment234439>

    Not clear to me how this differs from the whitelist (*.avro which specifies 
what files to process).



docs/learn/documentation/versioned/hdfs/consumer.md (line 99)
<https://reviews.apache.org/r/52570/#comment234441>

    1. Please Fix typo: retries
    2. Re-word to be precise: maximum number of retries (per-partition) before 
the container fails.



docs/learn/documentation/versioned/hdfs/consumer.md (line 105)
<https://reviews.apache.org/r/52570/#comment234443>

    We probably don't want to reference Jiras here in the docs.



docs/learn/documentation/versioned/jobs/configuration-table.html (line 1819)
<https://reviews.apache.org/r/52570/#comment234444>

    What's this configuration? Is this the number of messages? What are the 
implications of this? 
    
    I'm not in favor of exposing this tunable if this is not super-significant.



docs/learn/documentation/versioned/jobs/configuration-table.html (line 1824)
<https://reviews.apache.org/r/52570/#comment234446>

    The number of retry attempts when there is a failure to fetch messages from 
HDFS.
    
    Also, should document the behavior after failures after numRetries?



docs/learn/documentation/versioned/jobs/configuration-table.html (line 1829)
<https://reviews.apache.org/r/52570/#comment234448>

    Docs for whitelist and blacklist are the same? Are both referring to 
`unwanted` files? Do we want to re-word them?



docs/learn/documentation/versioned/jobs/configuration-table.html (line 1839)
<https://reviews.apache.org/r/52570/#comment234452>

    Not clear how this is useful? This needs more docs.



docs/learn/documentation/versioned/jobs/configuration-table.html (line 1842)
<https://reviews.apache.org/r/52570/#comment234453>

    What are the implications of this config? It's not obvious to me from the 
docs.



docs/learn/documentation/versioned/jobs/configuration-table.html (line 1847)
<https://reviews.apache.org/r/52570/#comment234454>

    If this is purely internal, why do we need this config?
    
    If there is no need for this to be set, what's the default value?



docs/learn/documentation/versioned/jobs/configuration-table.html (line 1849)
<https://reviews.apache.org/r/52570/#comment234456>

    Also, the config-key convention is slightly different from the rest of the 
Samza configs:
    
    For example, there's camelCase for configs used in the consumer but the 
producer has a different config style.
    
    (producer.hdfs.compression.type versus producer.hdfsCompressionType)


- Jagadish Venkatraman


On Jan. 24, 2017, 2:07 a.m., Hai Lu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52570/
> -----------------------------------------------------------
> 
> (Updated Jan. 24, 2017, 2:07 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-1025
>     https://issues.apache.org/jira/browse/SAMZA-1025
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> documentation for hdfs system consumer
> 
> 
> Diffs
> -----
> 
>   docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
>   docs/learn/documentation/versioned/hdfs/producer.md 
> b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
>   docs/learn/documentation/versioned/index.html 
> d0b14ece94341e2cb937cf32db480e69f93303c2 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 
> 
> Diff: https://reviews.apache.org/r/52570/diff/
> 
> 
> Testing
> -------
> 
> N/A
> 
> 
> Thanks,
> 
> Hai Lu
> 
>

Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

Reply via email to