Re: Making hello-samza easier to get started with

Martin Kleppmann Wed, 05 Feb 2014 06:23:44 -0800

Hi Chris,

On 4 Feb 2014, at 19:05, Chris Riccomini <[email protected]> wrote:
>> [...]
>> € YARN is very heavyweight (100MB download). Could we avoid using YARN in
>> hello-samza, in favour of LocalJobFactory? Does Kafka have a local mode
>> for development that doesn't require Zookeeper? The fewer dependencies
>> the better.
> 
> On the one hand, I agree with you that it's annoying to have so many
> dependencies get pulled in. On the other hand, these systems are
> non-trivial to install, and getting them up and running, and showing the
> full power of Samza is a big deal. When I wrote hello-samza, I originally
> was just going to use LocalJobFactory, and not even use Kafka. This would
> have eliminated all dependencies. I opted against this because I felt like
> it gave a much poorer feel of what Samza was, and how it worked in the
> real world. For example, having the AM dashboard is really helpful, and
> allows us to illustrate what containers are, etc.


I agree that it's good to show the full power of Samza, and make it easy to get 
started with YARN etc. But that raises the question: who is hello-samza 
intended for?

- Is it for somebody who just saw a link to the Samza website in a tweet, but 
who hasn't read the documentation yet, and who just wants to quickly decide 
whether to invest more time into finding out about Samza? (The 
"2-minute-quickly-playing-around" use case)

- Or is it for somebody who has already decided to try Samza, and wants a 
reference project as a starting point for their own project? (The 
"1-hour-experimentation" use case)

Both are valid use cases. The fact that "Hello Samza" appears as the very first 
item in the website navigation suggests that it's intended for the first case, 
whereas the full-on YARN install is more appropriate to the second case.

In that light, I'd like to suggest the following:

- We move both the Vagrant setup and bin/grid into a separate repository (call 
it "samza-instant-grid" or something like that). Since the Vagrant setup 
depends on bin/grid, it makes sense for the two to be in the same repository. 
That repo doesn't contain a particular Samza job -- it's focused on the purpose 
of getting to a working YARN+Kafka+ZK setup as quickly as possible, either on 
the local OS or inside a VM.

- We change hello-samza to use LocalJobFactory by default, for instant 
gratification of people who are completely new to Samza. And at the end of the 
hello-samza instructions we say something like: "Congratulations, you've run 
your first Samza job! But it was running in local mode, which is only for 
development, and doesn't have the resource isolation or fault tolerance 
features of a real Samza deployment. Check out [samza-instant-grid](LINK) to 
set up a miniature Samza cluster on your machine in 10 minutes. You can then 
deploy samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz to your 
local cluster, and see the same job running in a YARN container."

That would allow hello-samza to satisfy both the 
2-minute-quickly-playing-around use case and the 1-hour-experimentation use 
case. And it would have the side benefit of showing how to set up a project to 
use both local mode for development (which I think is genuinely useful) and 
also generate an artifact that is deployable to YARN.

Does that make sense?

>> € I somehow got my setup into a bad state (where YARN was running but its
>> web UI wouldn't load); I think it happened because I ran `vagrant up` at
>> the same time as `bin/grid bootstrap` outside of the VM, and the two
>> processes trampled on each other. Deleting the 'deploy' directory and
>> starting from a clean slate fixed it. Can we isolate Vagrant and local-OS
>> bootstrap from each other?
> 
> Yea, we really need to think this through. Originally, we only had local
> bin/grid (no Vagrant). Now, we have two different ways to run hello-samza,
> which is really confusing (especially since the README only talks about
> Vagrant, and the Samza website only talks about local mode). Jakob and I
> were talking about this as well. It seems like a good thing to move the
> Vagrant stuff somewhere else, and be clear about the two different ways of
> bootstrapping. Not quite sure about the best way to do this, but Jakob had
> some thoughts.

Jakob, would be interested to hear what you think.

>> € Can we make task logs go to stdout by default? Logs provide reassurance
>> that something is happening, and at the moment you have to dig around
>> somewhere in the deploy directory to find the log files.
> 
> Not quite sure what you mean here. You mean the ZK/YARN/Kafka logs?

The run-job.sh commands currently give no visual feedback as to what is 
happening -- you just start it, but then the job disappears into a 'black 
hole'. You can start the kafka-console-consumer to see the output of a job, or 
you can find it on the YARN web UI, but a more immediate form of feedback would 
be for the job's startup logs to appear on stdout.

I noticed a file deploy/samza/undefined-samza-container-name.log, which 
included some info from the Samza job starting up, such as the MOTD sent by the 
Wikipedia IRC gateway after connecting. That's the kind of output I was 
thinking of.

Showing logs on stdout probably makes most sense when a job is run through 
LocalJobFactory. If a job is deployed to YARN, it's understandable that the 
logs are not shown (because they are generated in a different process, 
potentially on a different machine).

>> € Can we shorten the commands? Having to unpack the .tar.gz file and then
>> copy/paste a scary long run-job.sh line makes the process feel arcane,
>> and obscures what is really happening. Perhaps just a shell script
>> wrapper for run-job.sh or a maven goal would do it.
> 
> Regarding the mkdir and .tar.gz unpacking, we should just do this as part
> of `mvn package`. If you want to make that change, I'm all for it.
> 
> As for hiding the run-job.sh, I'm not as convinced of getting rid of it. I
> kind of like exposing how Samza actually works to the developer, so they
> know. Hiding it behind some one-off script doesn't really help them
> understand Samza (of course the same argument could be made for hiding
> YARN/ZK/Kafka behind bin/grid). Perhaps we just need more documentation in
> the walkthrough about what this command does and what the parameters are?

If run-job.sh is part of samza-instant-grid, I think it's ok to keep it as-is, 
and document it.

For the 2-minute-quickly-playing-around use case, I fear that a long command 
mentioning factories is more confusing than enlightening. Am I right in 
thinking that when using LocalJobFactory, run-job.sh is not needed?

>> € Would it be possible to have maven download the dependencies, rather
>> than bin/grid calling curl on random URLs? Somehow it feels weird to have
>> a script download and run random code off the internet (although of
>> course that's what every package manager does, it's irrational). It would
>> also avoid re-downloading everything in case you decide to blow away the
>> deploy directory.
> 
> Not sure about this. All of this stuff is up in Apache's HTTP servers, but
> I'm not sure if the release packages for these projects are published into
> Maven central (I'm nearly 100% certain that Kafka isn't). If they're not,
> then having Maven download the packages is no different than having the
> shell script do it.
> 
> One alternative would be to have the bin/grid script cache the files
> locally somewhere, so that blowing away the deploy directory doesn't
> trigger a re-download of YARN/ZK/Kafka again.

Ok, having the shell script cache the files in another directory sounds good. 
I'm happy to make that change.

Cheers,
Martin

Re: Making hello-samza easier to get started with

Reply via email to