Re: Updating samza-sql branch to Java 1.7
Thanks everyone. Milinda On Tue, Apr 14, 2015 at 6:06 PM, Yi Pan nickpa...@gmail.com wrote: Merged master to samza-sql. On Tue, Apr 14, 2015 at 2:57 PM, Jakob Homan jgho...@gmail.com wrote: Yes, I removed the tests for JDK6 yesterday. We're 1.7 or above now for development. On 14 April 2015 at 12:47, Milinda Pathirage mpath...@umail.iu.edu wrote: Hi Devs, Calcite dropped support for Java 1.6 in 1.1.0-incubating. I want to use Calcite 1.2.0-incubating-SNAPSHOT in samza-sql branch. Is it okay to update samza-sql branch to Java 1.7? Thanks Milinda -- Milinda Pathirage PhD Student | Research Assistant School of Informatics and Computing | Data to Insight Center Indiana University twitter: milindalakmal skype: milinda.pathirage blog: http://milinda.pathirage.org -- Milinda Pathirage PhD Student | Research Assistant School of Informatics and Computing | Data to Insight Center Indiana University twitter: milindalakmal skype: milinda.pathirage blog: http://milinda.pathirage.org
Re: How to configure the Resource Manager endpoint for YARN?
I'll try that. Thanks, Chris. On Wed, Apr 15, 2015 at 9:37 AM, Chris Riccomini criccom...@apache.org wrote: Hey Roger, Not sure if this makes a difference, but have you tried using: export YARN_CONF_DIR=... Instead? This is what we use. Cheers, Chris On Wed, Apr 15, 2015 at 9:33 AM, Roger Hoover roger.hoo...@gmail.com wrote: Hi, I'm trying to deploy a job to a small YARN cluster. How do tell the launcher script where to find the Resource Manager? I tried creating a yarn-site.xml and setting HADOOP_CONF_DIR environment variable but it doesn't find my config. 2015-04-14 22:02:45 ClientHelper [INFO] trying to connect to RM 0.0.0.0:8032 2015-04-14 22:02:45 RMProxy [INFO] Connecting to ResourceManager at / 0.0.0.0:8032 Thanks, Roger
How to configure the Resource Manager endpoint for YARN?
Hi, I'm trying to deploy a job to a small YARN cluster. How do tell the launcher script where to find the Resource Manager? I tried creating a yarn-site.xml and setting HADOOP_CONF_DIR environment variable but it doesn't find my config. 2015-04-14 22:02:45 ClientHelper [INFO] trying to connect to RM 0.0.0.0:8032 2015-04-14 22:02:45 RMProxy [INFO] Connecting to ResourceManager at / 0.0.0.0:8032 Thanks, Roger
Re: Extra Systems and other extensions.
+1 ! I was going to do this for my use case as well. Would love to have this ! On Wed, Apr 15, 2015 at 9:24 AM, Roger Hoover roger.hoo...@gmail.com wrote: Dan, This is great. Would love to have a common ElasticSearch system producer. Cheers, Roger On Tue, Apr 14, 2015 at 1:34 PM, Dan danharve...@gmail.com wrote: Thanks Jakob, I agree they'll be more maintained and tested if they're in the main repo so that's great. I'll sort out Jira's and get some patches of what we've got working now out for review. - Dan On 14 April 2015 at 19:56, Jakob Homan jgho...@gmail.com wrote: Hey Dan- I'd love for the Elastic Search stuff to be added to the main code, as a separate module. Keeping these in the main source code keeps them more likely to be maintained and correct. The EvironemtnConfigRewriter can likely go in the same place as the ConfigRewriter interface, since it doesn't depend on Kafka as the current RegexTopicRewriter does. If you could open JIRAs for these, that would be great. Happy to shepard the code in. -Jakob On 14 April 2015 at 11:46, Dan danharve...@gmail.com wrote: Hey, At state.com we've started to write some generic extensions to Samza that we think would be more generally useful. We've got a ElasticsearchSystemProducer/Factory to output to an Elasticsearch index and EnvironmentConfigRewriter to modify config from environment variable. What's the best way for us to add things like this? Do you want more modules in the main project or should we just create some separate projects on github? It would be good to get core extensions like these shared to be tested and used by more people Thanks, Dan -- Thanks and regards Chinmay Soman
Maximum number of jobs
What's the maximum number of Samza jobs I can run simultaneously on a single cluster? Let's say these jobs are very lightweight -- they require little memory or processing power. However, I need a lot of them -- let's say I need to have 1,000,000 running at any given time. Is this reasonable or even possible?
Re: How to configure the Resource Manager endpoint for YARN?
Hey Roger, Hmm, that's good to know, lol. Wonder how our's is working. :) I'll poke around. Cheers, Chris On Wed, Apr 15, 2015 at 11:17 AM, Roger Hoover roger.hoo...@gmail.com wrote: Turns out that HADOOP_CONF_DIR is the right env var (YARN_CONF_DIR did not work). I had just messed up the directory path. Doh! Sent from my iPhone On Apr 15, 2015, at 9:41 AM, Roger Hoover roger.hoo...@gmail.com wrote: I'll try that. Thanks, Chris. On Wed, Apr 15, 2015 at 9:37 AM, Chris Riccomini criccom...@apache.org wrote: Hey Roger, Not sure if this makes a difference, but have you tried using: export YARN_CONF_DIR=... Instead? This is what we use. Cheers, Chris On Wed, Apr 15, 2015 at 9:33 AM, Roger Hoover roger.hoo...@gmail.com wrote: Hi, I'm trying to deploy a job to a small YARN cluster. How do tell the launcher script where to find the Resource Manager? I tried creating a yarn-site.xml and setting HADOOP_CONF_DIR environment variable but it doesn't find my config. 2015-04-14 22:02:45 ClientHelper [INFO] trying to connect to RM 0.0.0.0:8032 2015-04-14 22:02:45 RMProxy [INFO] Connecting to ResourceManager at / 0.0.0.0:8032 Thanks, Roger
Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end
On April 15, 2015, 6:20 p.m., Yi Pan (Data Infrastructure) wrote: samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java, line 37 https://reviews.apache.org/r/33219/diff/1/?file=930371#file930371line37 I assume that this class is used to convert the data schema/types in samza-sql-core model to Calcite's RelDataType? In that case, can we use the generic Schema class in samza-sql-core instead of implementation specific for Avro? Please discard this comment. I just realized that this converter is probably used in query validation which will need to convert schemas in an available Avro schema repo to Calcite data model. - Yi --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/#review80228 --- On April 15, 2015, 2:49 p.m., Milinda Pathirage wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/ --- (Updated April 15, 2015, 2:49 p.m.) Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure). Bugs: SAMZA-649 https://issues.apache.org/jira/browse/SAMZA-649 Repository: samza Description --- Moved Calcite based front-end to samza-sql-calcite module. Diffs - build.gradle a1c7133 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaCalciteConnection.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaQueryPreparingStatement.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/SamzaStreamTableFactory.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/TestQueryPlanner.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java PRE-CREATION samza-sql-core/src/main/java/org/apache/samza/sql/metadata/AvroSchemaConverter.java 3dad046 samza-sql-core/src/main/java/org/apache/samza/sql/planner/QueryPlanner.java 1dfb262 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaCalciteConnection.java 63b1da5 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaQueryPreparingStatement.java 0721573 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaSqlValidator.java f46c1f0 samza-sql-core/src/test/java/org/apache/samza/sql/planner/QueryPlannerTest.java 022116e samza-sql-core/src/test/java/org/apache/samza/sql/planner/SamzaStreamTableFactory.java f757d8f samza-sql-core/src/test/java/org/apache/samza/sql/test/metadata/TestAvroSchemaConverter.java b4ac5f5 settings.gradle 5cbb755 Diff: https://reviews.apache.org/r/33219/diff/ Testing --- ./bin/check-all.sh passed. Thanks, Milinda Pathirage
Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end
On April 15, 2015, 6:20 p.m., Yi Pan (Data Infrastructure) wrote: samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java, line 61 https://reviews.apache.org/r/33219/diff/1/?file=930367#file930367line61 One quick question: do we need to implement all those rules? Or are we mainly re-using the rules implemented in Calcite? We don't have to implement all those rules. Just reusing existing rules from Calcite. On April 15, 2015, 6:20 p.m., Yi Pan (Data Infrastructure) wrote: samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java, line 29 https://reviews.apache.org/r/33219/diff/1/?file=930370#file930370line29 Add Java doc here. will add java docs. - Milinda --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/#review80228 --- On April 15, 2015, 2:49 p.m., Milinda Pathirage wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/ --- (Updated April 15, 2015, 2:49 p.m.) Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure). Bugs: SAMZA-649 https://issues.apache.org/jira/browse/SAMZA-649 Repository: samza Description --- Moved Calcite based front-end to samza-sql-calcite module. Diffs - build.gradle a1c7133 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaCalciteConnection.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaQueryPreparingStatement.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/SamzaStreamTableFactory.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/TestQueryPlanner.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java PRE-CREATION samza-sql-core/src/main/java/org/apache/samza/sql/metadata/AvroSchemaConverter.java 3dad046 samza-sql-core/src/main/java/org/apache/samza/sql/planner/QueryPlanner.java 1dfb262 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaCalciteConnection.java 63b1da5 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaQueryPreparingStatement.java 0721573 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaSqlValidator.java f46c1f0 samza-sql-core/src/test/java/org/apache/samza/sql/planner/QueryPlannerTest.java 022116e samza-sql-core/src/test/java/org/apache/samza/sql/planner/SamzaStreamTableFactory.java f757d8f samza-sql-core/src/test/java/org/apache/samza/sql/test/metadata/TestAvroSchemaConverter.java b4ac5f5 settings.gradle 5cbb755 Diff: https://reviews.apache.org/r/33219/diff/ Testing --- ./bin/check-all.sh passed. Thanks, Milinda Pathirage
Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/#review80237 --- Ship it! +1 - Yi Pan (Data Infrastructure) On April 15, 2015, 2:49 p.m., Milinda Pathirage wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/ --- (Updated April 15, 2015, 2:49 p.m.) Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure). Bugs: SAMZA-649 https://issues.apache.org/jira/browse/SAMZA-649 Repository: samza Description --- Moved Calcite based front-end to samza-sql-calcite module. Diffs - build.gradle a1c7133 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaCalciteConnection.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaQueryPreparingStatement.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/SamzaStreamTableFactory.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/TestQueryPlanner.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java PRE-CREATION samza-sql-core/src/main/java/org/apache/samza/sql/metadata/AvroSchemaConverter.java 3dad046 samza-sql-core/src/main/java/org/apache/samza/sql/planner/QueryPlanner.java 1dfb262 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaCalciteConnection.java 63b1da5 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaQueryPreparingStatement.java 0721573 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaSqlValidator.java f46c1f0 samza-sql-core/src/test/java/org/apache/samza/sql/planner/QueryPlannerTest.java 022116e samza-sql-core/src/test/java/org/apache/samza/sql/planner/SamzaStreamTableFactory.java f757d8f samza-sql-core/src/test/java/org/apache/samza/sql/test/metadata/TestAvroSchemaConverter.java b4ac5f5 settings.gradle 5cbb755 Diff: https://reviews.apache.org/r/33219/diff/ Testing --- ./bin/check-all.sh passed. Thanks, Milinda Pathirage
Re: How to deal with bootstrapping
Hi Jeremy, If my understanding is correct, whenever you add a new rule, you want to apply this rule to the historical data. Right? If you do not care about duplication, you can create a new task that contains existing rules and new rules. Configure bootstrap. This will apply all the rules from the beginning of the input stream. The shortcoming is you will get duplicated results for old rules. If you can not tolerate the shortcoming, 1) get the offset of the latest-processed message of old rules. 2) In your new task, ignore messages before that offset for the old rules. 3) bootstrap. Hope this helps. Maybe your use case is more complicated? Thanks, Fang, Yan yanfang...@gmail.com On Wed, Apr 15, 2015 at 11:19 AM, jeremy p athomewithagroove...@gmail.com wrote: So, I'm wanting to use Samza for a project I'm working on, but I keep running into a problem with bootstrapping. Let's say there's a Kafka topic called Numbers that I want to consume with Samza. Let's say each message has a single integer in it, and I want to classify it as even or odd. So I have two topics that I'm using for output, one called Even and one called Odd. I write a simple stream task called Classifier that consumes the Numbers topic, examines each incoming integer and writes it back out to Even or Odd. Now, let's say I want to be able to add classifications dynamically, like : divisible by three, divisible by four, or numbers that appear in my date of birth. And let's say I have an API I can query that gives me all the assignment rules, such as when a number is divisble by 3, write it out to a topic called 'divisible_by_three', or when a number appears in the string 12/12/1981, write it to the 'my_birthday' topic. So now I rewrite my stream task to query this API for assignment rules. It reads integers from the Numbers topic and writes them back out to one or more output topics, according to the assignment rules. Now, let's make this even more complicated. When I add a new classification, I want to go back to the very beginning of the Numbers topic and classify them accordingly. Once we've consumed all the old historical integers, I want to apply this classification new integers as they come in. And this is where I get stuck. One thing I can do : when I want to add a new classification, I can create a bootstrap job by setting the systems.kafka.streams.numbers.samza.offset.default property to oldest. And that's great, but the problem is, once I've caught up, I'd like to kill the bootstrap job and just let the Classifier handle this new assignment. So, I'd want to do some kind of handover from the bootstrap job to the Classifier job. But how to do this? So, the question I must ask is this : Is Samza even an appopriate way to solve this problem? Has this problem ever come up for anybody else? How have they solved it? I would really like to use Samza because it seems like an appopriate technology, and I'd really really really really like to avoid re-inventing the wheel. A couple solutions I came up with : 1) The simple solution. Have a separate Samza job for each classification. If I want to add a new classification, I create a new job and set it up as a bootstrap job. This would solve the problem. However, we may want to have many, many classifications. It could be as many as 1,000,000, which would mean up to 1,000,000 simultaneously running jobs. This could create a lot of overhead for YARN and Kafka. 2) My overly-complicated workaround solution. Each assignment rule has an isnew flag. If it's a new classification that hasn't fully bootstrapped yet, the isnew flag is set to TRUE. When my classifier queries the API for assignment rules, it ignores any rule with an isnew flag. When I want to add a new classification, I create a new bootstrap job for that classification. Every so often, maybe every few days or so, if all of my bootstrap jobs have caught up, I kill all of the bootstrap jobs and classifier jobs. I set all the isnew flags to FALSE. Then I restart the classifier job. This is kind of an ugly solution, and I'm not even sure it would work. For one thing, I'd need some way of knowing if a boostrap job has caught up. Secondly, I'd essentially be restarting the classifier job periodically, which just seems like an ugly solution. I don't like it. 3) Some other kind of really complicated solution I haven't thought of yet, probably involving locks, transactions, concurrancy, and interprocess communication. Thanks for reading this whole thing. Please let me know if you have any suggestions.
Re: How to deal with bootstrapping
Hello Yan, Thank you for the suggestion! I think your solution would work, however, I am afraid it would create a performance problem for our users. Let's say we kill the Classifier task, and create a new Classifier task with both the existing rules and new rules. We get the offset of the latest-processed message for the old rules. Let's call this offset Last-Processed-Old-Rules. We ignore messages before Last-Processed-Old-Rules for the old rules. We configure the new Classifier task to be a bootstrap task. Let's say we have users who are watching the output topics, and they are expecting near-realtime updates. They won't see any updates for the old rules until our task has passed the Last-Processed-Old-Rules offset. If we have a lot of messages in that topic, that could take a long time. This is why I was hoping there would be a way to bootstrap the new rules while we're still processing the old rules. Do you think there is a way to do that? On Wed, Apr 15, 2015 at 2:56 PM, Yan Fang yanfang...@gmail.com wrote: Hi Jeremy, If my understanding is correct, whenever you add a new rule, you want to apply this rule to the historical data. Right? If you do not care about duplication, you can create a new task that contains existing rules and new rules. Configure bootstrap. This will apply all the rules from the beginning of the input stream. The shortcoming is you will get duplicated results for old rules. If you can not tolerate the shortcoming, 1) get the offset of the latest-processed message of old rules. 2) In your new task, ignore messages before that offset for the old rules. 3) bootstrap. Hope this helps. Maybe your use case is more complicated? Thanks, Fang, Yan yanfang...@gmail.com On Wed, Apr 15, 2015 at 11:19 AM, jeremy p athomewithagroove...@gmail.com wrote: So, I'm wanting to use Samza for a project I'm working on, but I keep running into a problem with bootstrapping. Let's say there's a Kafka topic called Numbers that I want to consume with Samza. Let's say each message has a single integer in it, and I want to classify it as even or odd. So I have two topics that I'm using for output, one called Even and one called Odd. I write a simple stream task called Classifier that consumes the Numbers topic, examines each incoming integer and writes it back out to Even or Odd. Now, let's say I want to be able to add classifications dynamically, like : divisible by three, divisible by four, or numbers that appear in my date of birth. And let's say I have an API I can query that gives me all the assignment rules, such as when a number is divisble by 3, write it out to a topic called 'divisible_by_three', or when a number appears in the string 12/12/1981, write it to the 'my_birthday' topic. So now I rewrite my stream task to query this API for assignment rules. It reads integers from the Numbers topic and writes them back out to one or more output topics, according to the assignment rules. Now, let's make this even more complicated. When I add a new classification, I want to go back to the very beginning of the Numbers topic and classify them accordingly. Once we've consumed all the old historical integers, I want to apply this classification new integers as they come in. And this is where I get stuck. One thing I can do : when I want to add a new classification, I can create a bootstrap job by setting the systems.kafka.streams.numbers.samza.offset.default property to oldest. And that's great, but the problem is, once I've caught up, I'd like to kill the bootstrap job and just let the Classifier handle this new assignment. So, I'd want to do some kind of handover from the bootstrap job to the Classifier job. But how to do this? So, the question I must ask is this : Is Samza even an appopriate way to solve this problem? Has this problem ever come up for anybody else? How have they solved it? I would really like to use Samza because it seems like an appopriate technology, and I'd really really really really like to avoid re-inventing the wheel. A couple solutions I came up with : 1) The simple solution. Have a separate Samza job for each classification. If I want to add a new classification, I create a new job and set it up as a bootstrap job. This would solve the problem. However, we may want to have many, many classifications. It could be as many as 1,000,000, which would mean up to 1,000,000 simultaneously running jobs. This could create a lot of overhead for YARN and Kafka. 2) My overly-complicated workaround solution. Each assignment rule has an isnew flag. If it's a new classification that hasn't fully bootstrapped yet, the isnew flag is set to TRUE. When my classifier queries the API for assignment rules, it ignores any rule with an isnew flag. When I want to add a new classification, I create a new bootstrap job for
Re: Review Request 33219: [SAMZA-649] Create samza-sql-calcite module for Calcite SQL front end
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33219/ --- (Updated April 15, 2015, 6:56 p.m.) Review request for samza, Chris Riccomini and Yi Pan (Data Infrastructure). Changes --- Added a java doc comment to SamzaSqlValidator class. Bugs: SAMZA-649 https://issues.apache.org/jira/browse/SAMZA-649 Repository: samza Description --- Moved Calcite based front-end to samza-sql-calcite module. Diffs (updated) - build.gradle a1c7133 samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/QueryPlanner.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaCalciteConnection.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaQueryPreparingStatement.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/planner/SamzaSqlValidator.java PRE-CREATION samza-sql-calcite/src/main/java/org/apache/samza/sql/calcite/schema/AvroSchemaConverter.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/SamzaStreamTableFactory.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/planner/TestQueryPlanner.java PRE-CREATION samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java PRE-CREATION samza-sql-core/src/main/java/org/apache/samza/sql/metadata/AvroSchemaConverter.java 3dad046 samza-sql-core/src/main/java/org/apache/samza/sql/planner/QueryPlanner.java 1dfb262 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaCalciteConnection.java 63b1da5 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaQueryPreparingStatement.java 0721573 samza-sql-core/src/main/java/org/apache/samza/sql/planner/SamzaSqlValidator.java f46c1f0 samza-sql-core/src/test/java/org/apache/samza/sql/planner/QueryPlannerTest.java 022116e samza-sql-core/src/test/java/org/apache/samza/sql/planner/SamzaStreamTableFactory.java f757d8f samza-sql-core/src/test/java/org/apache/samza/sql/test/metadata/TestAvroSchemaConverter.java b4ac5f5 settings.gradle 5cbb755 Diff: https://reviews.apache.org/r/33219/diff/ Testing --- ./bin/check-all.sh passed. Thanks, Milinda Pathirage
Re: How to deal with bootstrapping
Hi Jeremy, In order to reach this goal, we have to assume that the job with new rules can always catch up with the one with old rules. Otherwise, I think we do not have the choice but running a lot of jobs simultaneously. Under our assumption, we have job1 with old rules running, and now add job2 which integrates old rules and new rules to run. Job2 frequently checks the Last-Processed-Old-Rules offset from job1 (because job1 is running too), and it only applies new rule to the data until catch up with the Last-Processed-Old-Rules offset. Then it sends signal to the job1 and shutdown job1, and applies all rules to the stream. In terms of how to shutdown the job1, here is one solution http://mail-archives.apache.org/mod_mbox/samza-dev/201407.mbox/%3ccfe93d17.2d24b%25criccom...@linkedin.com%3E provided by Chris - e.g. you can have a control stream to get job1 shutdown. Samza will provide this kind of stream after SAMZA-348 https://issues.apache.org/jira/browse/SAMZA-348, which is under active development. Thanks, Fang, Yan yanfang...@gmail.com On Wed, Apr 15, 2015 at 12:17 PM, jeremy p athomewithagroove...@gmail.com wrote: Hello Yan, Thank you for the suggestion! I think your solution would work, however, I am afraid it would create a performance problem for our users. Let's say we kill the Classifier task, and create a new Classifier task with both the existing rules and new rules. We get the offset of the latest-processed message for the old rules. Let's call this offset Last-Processed-Old-Rules. We ignore messages before Last-Processed-Old-Rules for the old rules. We configure the new Classifier task to be a bootstrap task. Let's say we have users who are watching the output topics, and they are expecting near-realtime updates. They won't see any updates for the old rules until our task has passed the Last-Processed-Old-Rules offset. If we have a lot of messages in that topic, that could take a long time. This is why I was hoping there would be a way to bootstrap the new rules while we're still processing the old rules. Do you think there is a way to do that? On Wed, Apr 15, 2015 at 2:56 PM, Yan Fang yanfang...@gmail.com wrote: Hi Jeremy, If my understanding is correct, whenever you add a new rule, you want to apply this rule to the historical data. Right? If you do not care about duplication, you can create a new task that contains existing rules and new rules. Configure bootstrap. This will apply all the rules from the beginning of the input stream. The shortcoming is you will get duplicated results for old rules. If you can not tolerate the shortcoming, 1) get the offset of the latest-processed message of old rules. 2) In your new task, ignore messages before that offset for the old rules. 3) bootstrap. Hope this helps. Maybe your use case is more complicated? Thanks, Fang, Yan yanfang...@gmail.com On Wed, Apr 15, 2015 at 11:19 AM, jeremy p athomewithagroove...@gmail.com wrote: So, I'm wanting to use Samza for a project I'm working on, but I keep running into a problem with bootstrapping. Let's say there's a Kafka topic called Numbers that I want to consume with Samza. Let's say each message has a single integer in it, and I want to classify it as even or odd. So I have two topics that I'm using for output, one called Even and one called Odd. I write a simple stream task called Classifier that consumes the Numbers topic, examines each incoming integer and writes it back out to Even or Odd. Now, let's say I want to be able to add classifications dynamically, like : divisible by three, divisible by four, or numbers that appear in my date of birth. And let's say I have an API I can query that gives me all the assignment rules, such as when a number is divisble by 3, write it out to a topic called 'divisible_by_three', or when a number appears in the string 12/12/1981, write it to the 'my_birthday' topic. So now I rewrite my stream task to query this API for assignment rules. It reads integers from the Numbers topic and writes them back out to one or more output topics, according to the assignment rules. Now, let's make this even more complicated. When I add a new classification, I want to go back to the very beginning of the Numbers topic and classify them accordingly. Once we've consumed all the old historical integers, I want to apply this classification new integers as they come in. And this is where I get stuck. One thing I can do : when I want to add a new classification, I can create a bootstrap job by setting the systems.kafka.streams.numbers.samza.offset.default property to oldest. And that's great, but the problem is, once I've caught up, I'd like to kill the bootstrap job and just let the Classifier handle this new assignment. So, I'd want to do some kind of handover from
Re: Samza Unit Test Instrucations
Hi Yuanchi, There is no out-of-box unit tests provided by Samza. But there are some ways: 1) If you only want to test the logic in the Task class, normal unit tests will work. You can create a unit test that tests init(), process(), etc. 2) Create mock systems by implementing SystemAdmin, SystemConsumer, etc. You can refer to this one TestSystemStreamPartitionIterator https://github.com/apache/samza/blob/master/samza-api/src/test/java/org/apache/samza/system/TestSystemStreamPartitionIterator.java 3) bring up zookeeper, kafka in the unit test. Here is the reference TestStatefulTask https://github.com/apache/samza/blob/master/samza-test/src/test/scala/org/apache/samza/test/integration/TestStatefulTask.scala 4) run the job locally http://samza.apache.org/learn/documentation/0.9/jobs/job-runner.html by setting the job.factory.class http://samza.apache.org/learn/documentation/0.9/jobs/configuration-table.html . Usually, I think 1) and 4) are sufficient. Hope this helps. Thanks, Fang, Yan yanfang...@gmail.com On Wed, Apr 15, 2015 at 1:47 PM, Yuanchi Ning yuan...@uber.com wrote: Hello Samza Team, This is Yuanchi Ning from Uber Data Engineering, Realtime Metrics, Streaming Platform team. We are planning to use Samza to process the realtime data we have, and thanks for developing such an awesome open source project. While I am building our streaming service using Samza, I am wondering is there anyway to do unit tests for each Samza application Task instead of integration test? Say, set up the mocker environment for feeding data into MessageEnvelope or something similar? Thanks for your assistance. Best, Yuanchi
Samza Unit Test Instrucations
Hello Samza Team, This is Yuanchi Ning from Uber Data Engineering, Realtime Metrics, Streaming Platform team. We are planning to use Samza to process the realtime data we have, and thanks for developing such an awesome open source project. While I am building our streaming service using Samza, I am wondering is there anyway to do unit tests for each Samza application Task instead of integration test? Say, set up the mocker environment for feeding data into MessageEnvelope or something similar? Thanks for your assistance. Best, Yuanchi