Hey Alex, Yea, it looks like this could be documented better.
> trying to dump one with the script doesn't work with hello-samza. My guess is that you tried to use wikipedia-feed.properties. This doesn't have checkpointing enabled because IRC isn't repayable, and that's where we're getting the wikipedia feed from. If you run the wikipedia-parser.properties job, the checkpoint tool works: deploy/samza/bin/checkpoint-tool.sh --config-path=file:!/Code/incubator-samza-hello-samza/samza-job-package/src /main/config/wikipedia-parser.properties You'll see: 2014-11-10 12:20:24 CheckpointTool [INFO] Current checkpoint: systems.kafka.streams.wikipedia-raw.partitions.0 = 1533 The format for the files is: systems.kafka.streams.wikipedia-raw.partitions.0=1533 systems.<system name>.streams.<stream name>.partitions.<partition number>=<offset> Cheers, Chris PS: you'll need to tweak the log4j.xml file to look like this: <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> <log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/"> <appender name="console" class="org.apache.log4j.ConsoleAppender"> <param name="Target" value="System.out" /> <layout class="org.apache.log4j.PatternLayout"> <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} %c{1} [%p] %m%n" /> </layout> </appender> <root> <priority value="info" /> <appender-ref ref="console"/> </root> </log4j:configuration> By default, the log4j file output sot disk. You'll want it to output to console. This is fixed in 0.8.0. On 11/10/14 10:33 AM, "Alexander Taggart" <[email protected]> wrote: >Thanks, Chris. > >It's not clear to me from the documentation whether the checkpoint tool >can >be used to control the starting offset for a job that has not yet ever >been >run, and if so, how the properties file would need to be crafted. The >checkpoint doc page doesn't show what the properties file looks like, and >trying to dump one with the script doesn't work with hello-samza. > >On Mon, Nov 10, 2014 at 11:36 AM, Chris Riccomini < >[email protected]> wrote: > >> Hey Alexander, >> >> We have a checkpoint offset tool >> (./samza-shell/src/main/bash/checkpoint-tool.sh), which allows you to >>read >> and write offsets for all input partitions. This tool will allow you to >> arbitrarily set offsets before a job starts. >> >> We also support the samza.offset.default, and samza.reset.offset >> configurations: >> >> >> >>http://samza.incubator.apache.org/learn/documentation/0.7.0/jobs/configur >>at >> ion-table.html#streams >> >> These allow you to specify whether a job should read from the head or >>tail >> of an input stream when the job first starts. >> >> We don't currently support a way to change offsets once a job has >>already >> started. If you can get more specific about your use case, >> >> Cheers, >> Chris >> >> On 11/10/14 6:53 AM, "Alexander Taggart" <[email protected]> wrote: >> >> >We're investigating using Samza, and one aspect of our usage would >>require >> >being able to start a job such that it begins reading from a specified >> >Kafka offset. If I understand correctly, each job being bound to a >> >specific partition would need to be provided with a specific offset. >>Is >> >there any facility for providing such values, either via config or via >> >API? If not, what might be a good approach to implementing it (e.g., a >> >custom kafka consumer)? >> >>
