Hi, Connie,

Let me clarify a bit further:
1. ThreadJobFactory and ProcessJobFactory do not work w/ any cluster
management systems (i.e. YARN). Hence, when you say that you are
starting/stopping the grid and using ThreadJobFactory configuration, that
will not work. The only job factory that works w/ YARN is YarnJobFactory.
2. The rat error is a Apache license check. Make sure that the source files
in your project has the Apache license header at the beginning
3. Samza process the messages in a single-threaded mode. Two messages from
two different system stream partitions will be delivered to StreamTask in
sequence. Join usually means that you will need to buffer some amount of
messages in each stream in the KV-store. When a new message from stream A
comes, join task will need to lookup the relevant messages in the buffer of
other streams to yield the result.

Best,

-Yi

On Sun, Aug 30, 2015 at 10:11 PM, Connie Chen <spiritgr...@gmail.com> wrote:

> Hi Yi,
>
> Thanks for your response. A few more questions:
>
> 1) What are the semantics of the kafka topics in hello-world? (I am using
> the offline version that just produces updates in a loop). As in, if I do
> "bin/grid stop all" will the jobs re-read the same messages from Kafka or
> are they reading new ones? I get an error like:
>
> 2015-08-30 21:55:27 KafkaSystemConsumer [WARN] While refreshing brokers for
> > [__samza_coordinator_wikipedia-stats_1,0]:
> org.apache.samza.SamzaException:
> > Already consuming TopicPartition
> [__samza_coordinator_wikipedia-stats_1,0].
> > Retrying.
>
>
> when I restart the grid and wondering if the behavior after the restart
> should be the same every time.
>
> 2) I am using ThreadJobFactory as you suggested for "job.factory.class" in
> the stats task, but getting this error:
>
> [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.9:check
> > (default) on project hello-samza: Too many files with unapproved license:
>
>
> when I try to mvn package.
>
> 3) If I have two input streams, is the # of partitions configured from
> Kafka/ where can I set what partition the message goes to? Also, how would
> I get the two messages from two streams in StreamTask? (seems like
> IncomingMessageEnvelope that process() provides only represents one message
> from one stream, where would it take in the two messages sent to the same
> partition?)
>
> Thank you!
>
> Connie
>
> On Sun, Aug 30, 2015 at 8:34 PM, Yi Pan <nickpa...@gmail.com> wrote:
>
> > Hi, Connie,
> >
> > If I understand your trial example, you were trying to manually launch
> > Samza container via run-container.sh script. Unfortunately, this is only
> > possible via ProcessJobFactory and ThreadJobFactory right now. Using
> YARN,
> > you will have to start the job via run-job.sh on the RM. Then, the
> > SamzaAppMaster will start the containers automatically.
> >
> > As for the join example that you are looking for, you should be able to
> > configure a job that takes two streams w/ the same number of partitions
> and
> > use the default job.systemstreampartition.grouper.factory.
> > Then, each of your task will take in the messages from two streams from
> the
> > same partition and your code implementing StreamTask should perform the
> > join logic.
> >
> > Hope that explains the procedure at high level. Feel free to ask if you
> > have further questions.
> >
> > Thanks!
> >
> > -Yi
> >
> > On Sun, Aug 30, 2015 at 5:57 PM, Connie Chen <spiritgr...@gmail.com>
> > wrote:
> >
> > > I am relatively new to Samza as well as YARN/Zookeeper/Kafka, I went
> > > through the hello-samza
> > > <http://samza.apache.org/startup/hello-samza/latest/>example and was
> > > wondering if there was more documentation/tutorials about
> SamzaContainer.
> > >
> > > There are some files under samza.examples.wikipedia.system in
> > hello-samza,
> > > it looks like they are using the container API (from reading here
> > > <
> > >
> >
> http://samza.apache.org/learn/documentation/latest/container/samza-container.html
> > > >
> > >  and here
> > > <
> >
> http://samza.apache.org/learn/documentation/latest/container/streams.html
> > > >)
> > > and I can't figure out how to run them.
> > >
> > > I tried bin/run-container.sh but I get this number format exception:
> > >
> > > java.lang.NumberFormatException: null
> > > > at java.lang.Integer.parseInt(Integer.java:454)
> > > > at java.lang.Integer.parseInt(Integer.java:527)
> > > > at
> > > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> > > > at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> > > > at
> > > >
> > >
> >
> org.apache.samza.container.SamzaContainer$.safeMain(SamzaContainer.scala:82)
> > > > at
> > >
> org.apache.samza.container.SamzaContainer$.main(SamzaContainer.scala:69)
> > > > at
> org.apache.samza.container.SamzaContainer.main(SamzaContainer.scala)
> > >
> > >
> > > Anyone have tips or suggestions for how I can play around with this
> more?
> > > Mainly, I would like to try partitioning and see an example of joining
> > two
> > > streams with the embedded k-v store.
> > >
> > > Thanks in advance!
> > >
> > > Connie
> > >
> >
>

Reply via email to