Re: Updating samza-sql branch to Java 1.7

2015-04-15 Thread Milinda Pathirage
Thanks everyone.

Milinda

On Tue, Apr 14, 2015 at 6:06 PM, Yi Pan nickpa...@gmail.com wrote:

 Merged master to samza-sql.

 On Tue, Apr 14, 2015 at 2:57 PM, Jakob Homan jgho...@gmail.com wrote:

  Yes, I removed the tests for JDK6 yesterday.  We're 1.7 or above now
  for development.
 
  On 14 April 2015 at 12:47, Milinda Pathirage mpath...@umail.iu.edu
  wrote:
   Hi Devs,
  
   Calcite dropped support for Java 1.6 in 1.1.0-incubating. I want to use
   Calcite 1.2.0-incubating-SNAPSHOT in samza-sql branch. Is it okay to
  update
   samza-sql branch to Java 1.7?
  
   Thanks
   Milinda
  
   --
   Milinda Pathirage
  
   PhD Student | Research Assistant
   School of Informatics and Computing | Data to Insight Center
   Indiana University
  
   twitter: milindalakmal
   skype: milinda.pathirage
   blog: http://milinda.pathirage.org
 




-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org


Re: How to configure the Resource Manager endpoint for YARN?

2015-04-15 Thread Roger Hoover
I'll try that.  Thanks, Chris.

On Wed, Apr 15, 2015 at 9:37 AM, Chris Riccomini criccom...@apache.org
wrote:

 Hey Roger,

 Not sure if this makes a difference, but have you tried using:

   export YARN_CONF_DIR=...

 Instead? This is what we use.

 Cheers,
 Chris

 On Wed, Apr 15, 2015 at 9:33 AM, Roger Hoover roger.hoo...@gmail.com
 wrote:

  Hi,
 
  I'm trying to deploy a job to a small YARN cluster.  How do tell the
  launcher script where to find the Resource Manager?  I tried creating a
  yarn-site.xml and setting HADOOP_CONF_DIR environment variable but it
  doesn't find my config.
 
  2015-04-14 22:02:45 ClientHelper [INFO] trying to connect to RM
  0.0.0.0:8032
  2015-04-14 22:02:45 RMProxy [INFO] Connecting to ResourceManager at /
  0.0.0.0:8032
 
  Thanks,
 
  Roger
 



How to configure the Resource Manager endpoint for YARN?

2015-04-15 Thread Roger Hoover
Hi,

I'm trying to deploy a job to a small YARN cluster.  How do tell the
launcher script where to find the Resource Manager?  I tried creating a
yarn-site.xml and setting HADOOP_CONF_DIR environment variable but it
doesn't find my config.

2015-04-14 22:02:45 ClientHelper [INFO] trying to connect to RM 0.0.0.0:8032
2015-04-14 22:02:45 RMProxy [INFO] Connecting to ResourceManager at /
0.0.0.0:8032

Thanks,

Roger


Re: Extra Systems and other extensions.

2015-04-15 Thread Chinmay Soman
+1 ! I was going to do this for my use case as well. Would love to have
this !

On Wed, Apr 15, 2015 at 9:24 AM, Roger Hoover roger.hoo...@gmail.com
wrote:

 Dan,

 This is great.  Would love to have a common ElasticSearch system producer.

 Cheers,

 Roger

 On Tue, Apr 14, 2015 at 1:34 PM, Dan danharve...@gmail.com wrote:

  Thanks Jakob, I agree they'll be more maintained and tested if they're in
  the main repo so that's great.
 
  I'll sort out Jira's and get some patches of what we've got working now
 out
  for review.
 
   - Dan
 
 
  On 14 April 2015 at 19:56, Jakob Homan jgho...@gmail.com wrote:
 
   Hey Dan-
   I'd love for the Elastic Search stuff to be added to the main code, as
   a separate module.  Keeping these in the main source code keeps them
   more likely to be maintained and correct.
  
   The EvironemtnConfigRewriter can likely go in the same place as the
   ConfigRewriter interface, since it doesn't depend on Kafka as the
   current RegexTopicRewriter does.
  
   If you could open JIRAs for these, that would be great.  Happy to
   shepard the code in.
   -Jakob
  
   On 14 April 2015 at 11:46, Dan danharve...@gmail.com wrote:
Hey,
   
At state.com we've started to write some generic extensions to Samza
   that
we think would be more generally useful. We've got
a ElasticsearchSystemProducer/Factory to output to an Elasticsearch
  index
and EnvironmentConfigRewriter to modify config from environment
  variable.
   
What's the best way for us to add things like this? Do you want more
modules in the main project or should we just create some separate
   projects
on github?
   
It would be good to get core extensions like these shared to be
 tested
   and
used by more people
   
Thanks,
Dan
  
 




-- 
Thanks and regards

Chinmay Soman


Maximum number of jobs

2015-04-15 Thread jeremy p
What's the maximum number of Samza jobs I can run simultaneously on a
single cluster?  Let's say these jobs are very lightweight -- they require
little memory or processing power.  However, I need a lot of them -- let's
say I need to have 1,000,000 running at any given time.  Is this reasonable
or even possible?


Re: How to configure the Resource Manager endpoint for YARN?

2015-04-15 Thread Chris Riccomini
Hey Roger,

Hmm, that's good to know, lol. Wonder how our's is working. :) I'll poke
around.

Cheers,
Chris

On Wed, Apr 15, 2015 at 11:17 AM, Roger Hoover roger.hoo...@gmail.com
wrote:

 Turns out that HADOOP_CONF_DIR is the right env var (YARN_CONF_DIR did not
 work).  I had just messed up the directory path.  Doh!

 Sent from my iPhone

  On Apr 15, 2015, at 9:41 AM, Roger Hoover roger.hoo...@gmail.com
 wrote:
 
  I'll try that.  Thanks, Chris.
 
  On Wed, Apr 15, 2015 at 9:37 AM, Chris Riccomini criccom...@apache.org
 wrote:
  Hey Roger,
 
  Not sure if this makes a difference, but have you tried using:
 
export YARN_CONF_DIR=...
 
  Instead? This is what we use.
 
  Cheers,
  Chris
 
  On Wed, Apr 15, 2015 at 9:33 AM, Roger Hoover roger.hoo...@gmail.com
  wrote:
 
   Hi,
  
   I'm trying to deploy a job to a small YARN cluster.  How do tell the
   launcher script where to find the Resource Manager?  I tried creating
 a
   yarn-site.xml and setting HADOOP_CONF_DIR environment variable but it
   doesn't find my config.
  
   2015-04-14 22:02:45 ClientHelper [INFO] trying to connect to RM
   0.0.0.0:8032
   2015-04-14 22:02:45 RMProxy [INFO] Connecting to ResourceManager at /
   0.0.0.0:8032
  
   Thanks,
  
   Roger
  
 



Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Yi Pan (Data Infrastructure)


 On April 15, 2015, 6:20 p.m., Yi Pan (Data Infrastructure) wrote:
  samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java,
   line 37
  https://reviews.apache.org/r/33219/diff/1/?file=930371#file930371line37
 
  I assume that this class is used to convert the data schema/types in 
  samza-sql-core model to Calcite's RelDataType? In that case, can we use the 
  generic Schema class in samza-sql-core instead of implementation specific 
  for Avro?

Please discard this comment. I just realized that this converter is probably 
used in query validation which will need to convert schemas in an available 
Avro schema repo to Calcite data model.


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33219/#review80228
---


On April 15, 2015, 2:49 p.m., Milinda Pathirage wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33219/
 ---
 
 (Updated April 15, 2015, 2:49 p.m.)
 
 
 Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure).
 
 
 Bugs: SAMZA-649
 https://issues.apache.org/jira/browse/SAMZA-649
 
 
 Repository: samza
 
 
 Description
 ---
 
 Moved Calcite based front-end to samza-sql-calcite module.
 
 
 Diffs
 -
 
   build.gradle a1c7133 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaCalciteConnection.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaQueryPreparingStatement.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/SamzaStreamTableFactory.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/TestQueryPlanner.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java
  PRE-CREATION 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/metadata/AvroSchemaConverter.java
  3dad046 
   samza-sql-core/src/main/java/org/apache/samza/sql/planner/QueryPlanner.java 
 1dfb262 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaCalciteConnection.java
  63b1da5 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaQueryPreparingStatement.java
  0721573 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaSqlValidator.java
  f46c1f0 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/planner/QueryPlannerTest.java
  022116e 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/planner/SamzaStreamTableFactory.java
  f757d8f 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/test/metadata/TestAvroSchemaConverter.java
  b4ac5f5 
   settings.gradle 5cbb755 
 
 Diff: https://reviews.apache.org/r/33219/diff/
 
 
 Testing
 ---
 
 ./bin/check-all.sh passed.
 
 
 Thanks,
 
 Milinda Pathirage
 




Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Milinda Pathirage


 On April 15, 2015, 6:20 p.m., Yi Pan (Data Infrastructure) wrote:
  samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java,
   line 61
  https://reviews.apache.org/r/33219/diff/1/?file=930367#file930367line61
 
  One quick question: do we need to implement all those rules? Or are we 
  mainly re-using the rules implemented in Calcite?

We don't have to implement all those rules. Just reusing existing rules from 
Calcite.


 On April 15, 2015, 6:20 p.m., Yi Pan (Data Infrastructure) wrote:
  samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java,
   line 29
  https://reviews.apache.org/r/33219/diff/1/?file=930370#file930370line29
 
  Add Java doc here.

will add java docs.


- Milinda


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33219/#review80228
---


On April 15, 2015, 2:49 p.m., Milinda Pathirage wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33219/
 ---
 
 (Updated April 15, 2015, 2:49 p.m.)
 
 
 Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure).
 
 
 Bugs: SAMZA-649
 https://issues.apache.org/jira/browse/SAMZA-649
 
 
 Repository: samza
 
 
 Description
 ---
 
 Moved Calcite based front-end to samza-sql-calcite module.
 
 
 Diffs
 -
 
   build.gradle a1c7133 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaCalciteConnection.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaQueryPreparingStatement.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/SamzaStreamTableFactory.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/TestQueryPlanner.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java
  PRE-CREATION 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/metadata/AvroSchemaConverter.java
  3dad046 
   samza-sql-core/src/main/java/org/apache/samza/sql/planner/QueryPlanner.java 
 1dfb262 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaCalciteConnection.java
  63b1da5 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaQueryPreparingStatement.java
  0721573 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaSqlValidator.java
  f46c1f0 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/planner/QueryPlannerTest.java
  022116e 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/planner/SamzaStreamTableFactory.java
  f757d8f 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/test/metadata/TestAvroSchemaConverter.java
  b4ac5f5 
   settings.gradle 5cbb755 
 
 Diff: https://reviews.apache.org/r/33219/diff/
 
 
 Testing
 ---
 
 ./bin/check-all.sh passed.
 
 
 Thanks,
 
 Milinda Pathirage
 




Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33219/#review80237
---

Ship it!


+1

- Yi Pan (Data Infrastructure)


On April 15, 2015, 2:49 p.m., Milinda Pathirage wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33219/
 ---
 
 (Updated April 15, 2015, 2:49 p.m.)
 
 
 Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure).
 
 
 Bugs: SAMZA-649
 https://issues.apache.org/jira/browse/SAMZA-649
 
 
 Repository: samza
 
 
 Description
 ---
 
 Moved Calcite based front-end to samza-sql-calcite module.
 
 
 Diffs
 -
 
   build.gradle a1c7133 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaCalciteConnection.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaQueryPreparingStatement.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java
  PRE-CREATION 
   
 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/SamzaStreamTableFactory.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/TestQueryPlanner.java
  PRE-CREATION 
   
 samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java
  PRE-CREATION 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/metadata/AvroSchemaConverter.java
  3dad046 
   samza-sql-core/src/main/java/org/apache/samza/sql/planner/QueryPlanner.java 
 1dfb262 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaCalciteConnection.java
  63b1da5 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaQueryPreparingStatement.java
  0721573 
   
 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaSqlValidator.java
  f46c1f0 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/planner/QueryPlannerTest.java
  022116e 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/planner/SamzaStreamTableFactory.java
  f757d8f 
   
 samza-sql-core/src/test/java/org/apache/samza/sql/test/metadata/TestAvroSchemaConverter.java
  b4ac5f5 
   settings.gradle 5cbb755 
 
 Diff: https://reviews.apache.org/r/33219/diff/
 
 
 Testing
 ---
 
 ./bin/check-all.sh passed.
 
 
 Thanks,
 
 Milinda Pathirage
 




Re: How to deal with bootstrapping

2015-04-15 Thread Yan Fang
Hi Jeremy,

If my understanding is correct, whenever you add a new rule, you want to
apply this rule to the historical data. Right?

If you do not care about duplication, you can create a new task that
contains existing rules and new rules. Configure bootstrap. This will apply
all the rules from the beginning of the input stream. The shortcoming is
you will get duplicated results for old rules.

If you can not tolerate the shortcoming, 1) get the offset of the
latest-processed message of old rules. 2) In your new task, ignore messages
before that offset for the old rules. 3) bootstrap.

Hope this helps. Maybe your use case is more complicated?

Thanks,

Fang, Yan
yanfang...@gmail.com

On Wed, Apr 15, 2015 at 11:19 AM, jeremy p athomewithagroove...@gmail.com
wrote:

 So, I'm wanting to use Samza for a project I'm working on, but I keep
 running into a problem with bootstrapping.

 Let's say there's a Kafka topic called Numbers that I want to consume with
 Samza.  Let's say each message has a single integer in it, and I want to
 classify it as even or odd.  So I have two topics that I'm using for
 output, one called Even and one called Odd.  I write a simple stream task
 called Classifier that consumes the Numbers topic, examines each incoming
 integer and writes it back out to Even or Odd.

 Now, let's say I want to be able to add classifications dynamically, like :
 divisible by three, divisible by four, or numbers that appear in my
 date of birth.  And let's say I have an API I can query that gives me all
 the assignment rules, such as when a number is divisble by 3, write it out
 to a topic called 'divisible_by_three', or when a number appears in the
 string 12/12/1981, write it to the 'my_birthday' topic.  So now I rewrite
 my stream task to query this API for assignment rules.  It reads integers
 from the Numbers topic and writes them back out to one or more output
 topics, according to the assignment rules.

 Now, let's make this even more complicated.  When I add a new
 classification, I want to go back to the very beginning of the Numbers
 topic and classify them accordingly.  Once we've consumed all the old
 historical integers, I want to apply this classification new integers as
 they come in.

 And this is where I get stuck.

 One thing I can do : when I want to add a new classification, I can create
 a bootstrap job by setting the
 systems.kafka.streams.numbers.samza.offset.default property to oldest.
 And that's great, but the problem is, once I've caught up, I'd like to
 kill the bootstrap job and just let the Classifier handle this new
 assignment.  So, I'd want to do some kind of handover from the bootstrap
 job to the Classifier job.  But how to do this?

 So, the question I must ask is this : Is Samza even an appopriate way to
 solve this problem?  Has this problem ever come up for anybody else?  How
 have they solved it?  I would really like to use Samza because it seems
 like an appopriate technology, and I'd really really really really like to
 avoid re-inventing the wheel.

 A couple solutions I came up with :

 1) The simple solution.  Have a separate Samza job for each
 classification.  If I want to add a new classification, I create a new job
 and set it up as a bootstrap job.  This would solve the problem.  However,
 we may want to have many, many classifications.  It could be as many as
 1,000,000, which would mean up to 1,000,000 simultaneously running jobs.
 This could create a lot of overhead for YARN and Kafka.

 2) My overly-complicated workaround solution.  Each assignment rule has an
 isnew flag.  If it's a new classification that hasn't fully bootstrapped
 yet, the isnew flag is set to TRUE.  When my classifier queries the API
 for assignment rules, it ignores any rule with an isnew flag.  When I
 want to add a new classification, I create a new bootstrap job for that
 classification.  Every so often, maybe every few days or so, if all of my
 bootstrap jobs have caught up, I kill all of the bootstrap jobs and
 classifier jobs.  I set all the isnew flags to FALSE.  Then I restart the
 classifier job.  This is kind of an ugly solution, and I'm not even sure it
 would work.  For one thing, I'd need some way of knowing if a boostrap job
 has caught up.  Secondly, I'd essentially be restarting the classifier
 job periodically, which just seems like an ugly solution.  I don't like it.

 3) Some other kind of really complicated solution I haven't thought of yet,
 probably involving locks, transactions, concurrancy, and interprocess
 communication.

 Thanks for reading this whole thing.  Please let me know if you have any
 suggestions.



Re: How to deal with bootstrapping

2015-04-15 Thread jeremy p
Hello Yan,

Thank you for the suggestion!  I think your solution would work, however, I
am afraid it would create a performance problem for our users.

Let's say we kill the Classifier task, and create a new Classifier task
with both the existing rules and new rules. We get the offset of the
latest-processed message for the old rules.  Let's call this offset
Last-Processed-Old-Rules.  We ignore messages
before Last-Processed-Old-Rules for the old rules.  We configure the new
Classifier task to be a bootstrap task.

Let's say we have users who are watching the output topics, and they are
expecting near-realtime updates.  They won't see any updates for the old
rules until our task has passed the Last-Processed-Old-Rules offset.  If we
have a lot of messages in that topic, that could take a long time.  This is
why I was hoping there would be a way to bootstrap the new rules while
we're still processing the old rules.  Do you think there is a way to do
that?

On Wed, Apr 15, 2015 at 2:56 PM, Yan Fang yanfang...@gmail.com wrote:

 Hi Jeremy,

 If my understanding is correct, whenever you add a new rule, you want to
 apply this rule to the historical data. Right?

 If you do not care about duplication, you can create a new task that
 contains existing rules and new rules. Configure bootstrap. This will apply
 all the rules from the beginning of the input stream. The shortcoming is
 you will get duplicated results for old rules.

 If you can not tolerate the shortcoming, 1) get the offset of the
 latest-processed message of old rules. 2) In your new task, ignore messages
 before that offset for the old rules. 3) bootstrap.

 Hope this helps. Maybe your use case is more complicated?

 Thanks,

 Fang, Yan
 yanfang...@gmail.com

 On Wed, Apr 15, 2015 at 11:19 AM, jeremy p athomewithagroove...@gmail.com
 
 wrote:

  So, I'm wanting to use Samza for a project I'm working on, but I keep
  running into a problem with bootstrapping.
 
  Let's say there's a Kafka topic called Numbers that I want to consume
 with
  Samza.  Let's say each message has a single integer in it, and I want to
  classify it as even or odd.  So I have two topics that I'm using for
  output, one called Even and one called Odd.  I write a simple stream task
  called Classifier that consumes the Numbers topic, examines each incoming
  integer and writes it back out to Even or Odd.
 
  Now, let's say I want to be able to add classifications dynamically,
 like :
  divisible by three, divisible by four, or numbers that appear in my
  date of birth.  And let's say I have an API I can query that gives me
 all
  the assignment rules, such as when a number is divisble by 3, write it
 out
  to a topic called 'divisible_by_three', or when a number appears in the
  string 12/12/1981, write it to the 'my_birthday' topic.  So now I
 rewrite
  my stream task to query this API for assignment rules.  It reads integers
  from the Numbers topic and writes them back out to one or more output
  topics, according to the assignment rules.
 
  Now, let's make this even more complicated.  When I add a new
  classification, I want to go back to the very beginning of the Numbers
  topic and classify them accordingly.  Once we've consumed all the old
  historical integers, I want to apply this classification new integers
 as
  they come in.
 
  And this is where I get stuck.
 
  One thing I can do : when I want to add a new classification, I can
 create
  a bootstrap job by setting the
  systems.kafka.streams.numbers.samza.offset.default property to
 oldest.
  And that's great, but the problem is, once I've caught up, I'd like to
  kill the bootstrap job and just let the Classifier handle this new
  assignment.  So, I'd want to do some kind of handover from the bootstrap
  job to the Classifier job.  But how to do this?
 
  So, the question I must ask is this : Is Samza even an appopriate way to
  solve this problem?  Has this problem ever come up for anybody else?  How
  have they solved it?  I would really like to use Samza because it seems
  like an appopriate technology, and I'd really really really really like
 to
  avoid re-inventing the wheel.
 
  A couple solutions I came up with :
 
  1) The simple solution.  Have a separate Samza job for each
  classification.  If I want to add a new classification, I create a new
 job
  and set it up as a bootstrap job.  This would solve the problem.
 However,
  we may want to have many, many classifications.  It could be as many as
  1,000,000, which would mean up to 1,000,000 simultaneously running jobs.
  This could create a lot of overhead for YARN and Kafka.
 
  2) My overly-complicated workaround solution.  Each assignment rule has
 an
  isnew flag.  If it's a new classification that hasn't fully
 bootstrapped
  yet, the isnew flag is set to TRUE.  When my classifier queries the API
  for assignment rules, it ignores any rule with an isnew flag.  When I
  want to add a new classification, I create a new bootstrap job for 

Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end

2015-04-15 Thread Milinda Pathirage

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33219/
---

(Updated April 15, 2015, 6:56 p.m.)


Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure).


Changes
---

Added a java doc comment to SamzaSqlValidator class.


Bugs: SAMZA-649
https://issues.apache.org/jira/browse/SAMZA-649


Repository: samza


Description
---

Moved Calcite based front-end to samza-sql-calcite module.


Diffs (updated)
-

  build.gradle a1c7133 
  
samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java
 PRE-CREATION 
  
samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaCalciteConnection.java
 PRE-CREATION 
  
samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaQueryPreparingStatement.java
 PRE-CREATION 
  
samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java
 PRE-CREATION 
  
samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java
 PRE-CREATION 
  
samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/SamzaStreamTableFactory.java
 PRE-CREATION 
  
samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/TestQueryPlanner.java
 PRE-CREATION 
  
samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java
 PRE-CREATION 
  
samza-sql-core/src/main/java/org/apache/samza/sql/metadata/AvroSchemaConverter.java
 3dad046 
  samza-sql-core/src/main/java/org/apache/samza/sql/planner/QueryPlanner.java 
1dfb262 
  
samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaCalciteConnection.java
 63b1da5 
  
samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaQueryPreparingStatement.java
 0721573 
  
samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaSqlValidator.java
 f46c1f0 
  
samza-sql-core/src/test/java/org/apache/samza/sql/planner/QueryPlannerTest.java 
022116e 
  
samza-sql-core/src/test/java/org/apache/samza/sql/planner/SamzaStreamTableFactory.java
 f757d8f 
  
samza-sql-core/src/test/java/org/apache/samza/sql/test/metadata/TestAvroSchemaConverter.java
 b4ac5f5 
  settings.gradle 5cbb755 

Diff: https://reviews.apache.org/r/33219/diff/


Testing
---

./bin/check-all.sh passed.


Thanks,

Milinda Pathirage



Re: How to deal with bootstrapping

2015-04-15 Thread Yan Fang
Hi Jeremy,

In order to reach this goal, we have to assume that the job with new rules
can always catch up with the one with old rules. Otherwise, I think we do
not have the choice but running a lot of jobs simultaneously.

Under our assumption, we have job1 with old rules running, and now add job2
which integrates old rules and new rules to run. Job2 frequently
checks the Last-Processed-Old-Rules
offset from job1 (because job1 is running too), and it only applies new
rule to the data until catch up with the Last-Processed-Old-Rules offset.
Then it sends signal to the job1 and shutdown job1, and applies all rules
to the stream.

In terms of how to shutdown the job1, here is one solution
http://mail-archives.apache.org/mod_mbox/samza-dev/201407.mbox/%3ccfe93d17.2d24b%25criccom...@linkedin.com%3E
provided by Chris - e.g. you can have a control stream to get job1
shutdown. Samza will provide this kind of stream after SAMZA-348
https://issues.apache.org/jira/browse/SAMZA-348, which is under active
development.

Thanks,

Fang, Yan
yanfang...@gmail.com

On Wed, Apr 15, 2015 at 12:17 PM, jeremy p athomewithagroove...@gmail.com
wrote:

 Hello Yan,

 Thank you for the suggestion!  I think your solution would work, however, I
 am afraid it would create a performance problem for our users.

 Let's say we kill the Classifier task, and create a new Classifier task
 with both the existing rules and new rules. We get the offset of the
 latest-processed message for the old rules.  Let's call this offset
 Last-Processed-Old-Rules.  We ignore messages
 before Last-Processed-Old-Rules for the old rules.  We configure the new
 Classifier task to be a bootstrap task.

 Let's say we have users who are watching the output topics, and they are
 expecting near-realtime updates.  They won't see any updates for the old
 rules until our task has passed the Last-Processed-Old-Rules offset.  If we
 have a lot of messages in that topic, that could take a long time.  This is
 why I was hoping there would be a way to bootstrap the new rules while
 we're still processing the old rules.  Do you think there is a way to do
 that?

 On Wed, Apr 15, 2015 at 2:56 PM, Yan Fang yanfang...@gmail.com wrote:

  Hi Jeremy,
 
  If my understanding is correct, whenever you add a new rule, you want to
  apply this rule to the historical data. Right?
 
  If you do not care about duplication, you can create a new task that
  contains existing rules and new rules. Configure bootstrap. This will
 apply
  all the rules from the beginning of the input stream. The shortcoming is
  you will get duplicated results for old rules.
 
  If you can not tolerate the shortcoming, 1) get the offset of the
  latest-processed message of old rules. 2) In your new task, ignore
 messages
  before that offset for the old rules. 3) bootstrap.
 
  Hope this helps. Maybe your use case is more complicated?
 
  Thanks,
 
  Fang, Yan
  yanfang...@gmail.com
 
  On Wed, Apr 15, 2015 at 11:19 AM, jeremy p 
 athomewithagroove...@gmail.com
  
  wrote:
 
   So, I'm wanting to use Samza for a project I'm working on, but I keep
   running into a problem with bootstrapping.
  
   Let's say there's a Kafka topic called Numbers that I want to consume
  with
   Samza.  Let's say each message has a single integer in it, and I want
 to
   classify it as even or odd.  So I have two topics that I'm using for
   output, one called Even and one called Odd.  I write a simple stream
 task
   called Classifier that consumes the Numbers topic, examines each
 incoming
   integer and writes it back out to Even or Odd.
  
   Now, let's say I want to be able to add classifications dynamically,
  like :
   divisible by three, divisible by four, or numbers that appear in
 my
   date of birth.  And let's say I have an API I can query that gives me
  all
   the assignment rules, such as when a number is divisble by 3, write it
  out
   to a topic called 'divisible_by_three', or when a number appears in
 the
   string 12/12/1981, write it to the 'my_birthday' topic.  So now I
  rewrite
   my stream task to query this API for assignment rules.  It reads
 integers
   from the Numbers topic and writes them back out to one or more output
   topics, according to the assignment rules.
  
   Now, let's make this even more complicated.  When I add a new
   classification, I want to go back to the very beginning of the Numbers
   topic and classify them accordingly.  Once we've consumed all the old
   historical integers, I want to apply this classification new integers
  as
   they come in.
  
   And this is where I get stuck.
  
   One thing I can do : when I want to add a new classification, I can
  create
   a bootstrap job by setting the
   systems.kafka.streams.numbers.samza.offset.default property to
  oldest.
   And that's great, but the problem is, once I've caught up, I'd like
 to
   kill the bootstrap job and just let the Classifier handle this new
   assignment.  So, I'd want to do some kind of handover from 

Re: Samza Unit Test Instrucations

2015-04-15 Thread Yan Fang
Hi Yuanchi,

There is no out-of-box unit tests provided by Samza. But there are some
ways:

1) If you only want to test the logic in the Task class, normal unit tests
will work. You can create a unit test that tests init(), process(), etc.

2) Create mock systems by implementing SystemAdmin, SystemConsumer, etc.
You can refer to this one TestSystemStreamPartitionIterator
https://github.com/apache/samza/blob/master/samza-api/src/test/java/org/apache/samza/system/TestSystemStreamPartitionIterator.java

3) bring up zookeeper, kafka in the unit test. Here is the reference
TestStatefulTask
https://github.com/apache/samza/blob/master/samza-test/src/test/scala/org/apache/samza/test/integration/TestStatefulTask.scala

4) run the job locally
http://samza.apache.org/learn/documentation/0.9/jobs/job-runner.html by
setting the job.factory.class
http://samza.apache.org/learn/documentation/0.9/jobs/configuration-table.html
.

Usually, I think 1) and 4) are sufficient. Hope this helps.

Thanks,

Fang, Yan
yanfang...@gmail.com

On Wed, Apr 15, 2015 at 1:47 PM, Yuanchi Ning yuan...@uber.com wrote:

 Hello Samza Team,

 This is Yuanchi Ning from Uber Data Engineering, Realtime Metrics,
 Streaming Platform team.

 We are planning to use Samza to process the realtime data we have, and
 thanks for developing such an awesome open source project.

 While I am building our streaming service using Samza, I am wondering is
 there anyway to do unit tests for each Samza application Task instead of
 integration test? Say, set up the mocker environment for feeding data into
 MessageEnvelope or something similar?

 Thanks for your assistance.

 Best,
 Yuanchi



Samza Unit Test Instrucations

2015-04-15 Thread Yuanchi Ning
Hello Samza Team,

This is Yuanchi Ning from Uber Data Engineering, Realtime Metrics,
Streaming Platform team.

We are planning to use Samza to process the realtime data we have, and
thanks for developing such an awesome open source project.

While I am building our streaming service using Samza, I am wondering is
there anyway to do unit tests for each Samza application Task instead of
integration test? Say, set up the mocker environment for feeding data into
MessageEnvelope or something similar?

Thanks for your assistance.

Best,
Yuanchi