[
https://issues.apache.org/jira/browse/SAMZA-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447060#comment-15447060
]
Navina Ramesh commented on SAMZA-967:
-------------------------------------
[~lhaiesp] Sorry for the delay in my review. I strongly urge you to post a
\[DISCUSS\] or \[RFC\] email in the dev mailing list to get more eyes on your
work and potentially, more feedback.
Overall, the design document looks good. I have couple of questions:
* Is the “End of Stream” feature a pre-requisite for HDFS consumer? If yes,
link the corresponding JIRA and design document. Providing a high-level
description of how that feature will be leveraged for solving this problem will
layout more ground-work for readers who are not familiar about this
* One of the goals and non-goals are slightly overlapping. "(Goal) The system
consumer should support a variety of folder structures and filename
conventions" and "(Non-Goal) Support ALL kinds of HDFS folder structures and
filename formats" . Can you specifically call out which structure and
conventions you are supporting or call out which ones you are not supporting?
Just to more clarity to the document.
* Along with the 3rd point under Assumptions, you can call out "write-once,
read-many" as the underlying usage pattern.
* What does the whitelist and blacklist here consists of ? Why do we need both
? Can you provide example of how this config will look like?
* In case of repartitioner, multiple samza tasks cannot write to the same file.
Hence, each task can write in a separate file within the partition directory ->
what defines the ordering among these files when the downstream job is
consuming ? is it based on timestamp?
* when does the HDFSSystemAdmin write the PartitionDescriptor to HDFS?? Is it
done by the job coordinator or by each container?
* Is the PartitionDescriptor file expected to follow any convention? Or is it
simply going to contain a map?
Cheers!
PS: I am looking at your RB now :)
> Add HDFS system consumer to Samza
> ---------------------------------
>
> Key: SAMZA-967
> URL: https://issues.apache.org/jira/browse/SAMZA-967
> Project: Samza
> Issue Type: Sub-task
> Reporter: Hai
> Assignee: Hai
> Fix For: 0.12.0
>
> Attachments: HDFSSystemConsumer.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)