All, thanks again for your feedback. I just consolidated some of these learnings with some code samples here.
http://www.mammothdatallc.com/blog/accumulo-in-depth-look-at-filters-combiners-iterators-against-complex-values/ Best, -Mike On Fri, Jul 18, 2014 at 11:54 AM, William Slacum < wilhelm.von.cl...@accumulo.net> wrote: > Oh wow, I have totally read your problem incorrectly then. I thought you > wanted a total count across rows for some reasoning (when you mentioned you > had versioning turned off, things clicked). > > You can use a combiner, but I'd write an iterator that strips out the > count field for each value (like we did the other iterator), and then place > that lower in the iterator stack. This way you can get around your original > issue with the combiner only taking a single input/output type. > > > On Tue, Jul 15, 2014 at 2:25 PM, Adam Fuchs <afu...@apache.org> wrote: > >> Mike, >> >> The way we usually aggregate by row is to check the source's top key >> within the next function to see if it breaks the row boundary. If your >> source starts giving you data in the next row then break out of the loop in >> the next function. You'll also need to construct a row key to return from >> your iterator and then handle the reseeking case (automatic seeking to >> second key in row). See the RowEncodingIterator for hints on >> implementation. You might actually want to subclass RowEncodingIterator to >> implement your counter. >> >> Cheers, >> Adam >> Cool. I'll write something up and share. >> >> I'm curious how to get my Counter (WrappingIterator) implementation to >> aggregate by row (which, for some reason, I assumed was default?) >> >> Let's say I have rows (and CF="", CQ="" and versioningiterator off): >> 1 (Value1, Value 2...Value N) >> 2 >> 3 >> >> How can my iterator return? >> 1 (Count of values 1..N) >> 2 (Count of values 1..N) >> 3 ... >> >> I tried scan -b "1" -e "1" and it counts an individual row. But if I >> don't specify anything, it returns, >> 3 (Count of all values across all rows) >> >> Code: >> http://pastebin.com/8xFNLHFS >> >> Example: >> root@dev pe> listiter -scan -t pojo >> - >> - Iterator counter, scan scope options: >> - iteratorPriority = 10 >> - iteratorClassName = iterators.Counter >> - >> root@dev pe> scan -b "1_1_20140101" -e "1_1_20140101" >> 1_1_20140101 : [public] 65 >> >> root@dev pe> scan -b "1_1_20140101" -e "3_9_20140727" >> 3_9_20140727 : [public] 100000 >> >> root@dev pe> scan >> 3_9_20140727 : [public] 100000 >> >> >> Thanks. >> >> -Mike >> >> >> >> On Tue, Jul 15, 2014 at 12:29 PM, Josh Elser <josh.el...@gmail.com> >> wrote: >> >>> There's been some mention about a desire to rethink the Iterator >>> interface as it has some deficiencies (notably the lack of a "cleanup" >>> before the iterators are torn down), but no one has stated that they're >>> actively working on this. >>> >>> Getting better documentation wrt to convetions: let us know where the >>> Accumulo documentation falls short (and give us patches to fix the >>> documentation :D). Additionally, write up your own findings from problems >>> that you've run into. It's the entire community (users specifically) that >>> we need to help encourage to grow. >>> >>> Even things as simple as "how do I count entries in an iterator" are big >>> as you are now an "expert" on the subject :) >>> >>> >>> On 7/15/14, 12:17 PM, Michael Moss wrote: >>> >>>> That worked ;) - Thanks! >>>> >>>> What a journey... >>>> >>>> I like Accumulo's architecture and promise, but the difficulty in >>>> querying it (lack of documentation, conventions) is a major concern and >>>> I'd imagine has to have an impact on adoption. I'm curious if there have >>>> been any conversations around changing the interface around iterators >>>> which are still confusing to me. Let me know how I can help! >>>> >>>> >>>> On Tue, Jul 15, 2014 at 12:03 PM, William Slacum >>>> <wilhelm.von.cl...@accumulo.net <mailto:wilhelm.von.cl...@accumulo.net >>>> >> >>>> >>>> wrote: >>>> >>>> Herp... serves me right for not setting up a proper test case. >>>> >>>> I think you need to override seek as well: >>>> >>>> @Override >>>> public void seek(...) throws IOException { >>>> super.seek(...); >>>> next(); >>>> } >>>> >>>> I think I just realized the wrapping iterator could use some clean >>>> up, because this isn't obvious. Basically after the wrapping >>>> iterator's seek is called, it never calls the implementor's next() >>>> to actually set up the first top key and value. >>>> >>>> >>>> >>>> On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss >>>> <michael.m...@gmail.com <mailto:michael.m...@gmail.com>> wrote: >>>> >>>> I set up debugging and am rethrowing the exception. What's >>>> strange is it appears that despite the iterator instance being >>>> properly set to iterator.Counter (my implementation), my >>>> breakpoints aren't being hit, only in the parent classes >>>> (Wrapping Iterator) and (SortedKeyValueIterator). >>>> >>>> I have two rows in the table, when I scan with no iterator: >>>> 2014-07-15 06:46:26,577 [Audit ] INFO : operation: permitted; >>>> user: root; action: scan; targetTable: pojo; authorizations: >>>> public,; range: (-inf,+inf); columns: []; iterators: []; >>>> iteratorOptions: {}; >>>> 2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess >>>> tid 10.0.2.15:45073 <http://10.0.2.15:45073> 8*2 entries* in >>>> >>>> 0.01 secs, nbTimes = [7 7 7.00 1] >>>> >>>> When I scan with the iterator (0 entries?): >>>> 2014-07-15 06:45:58,036 [Audit ] INFO : operation: permitted; >>>> user: root; action: scan; targetTable: pojo; authorizations: >>>> public,; range: (-inf,+inf); columns: []; iterators: []; >>>> iteratorOptions: {}; >>>> 2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess >>>> tid 10.0.2.15:44992 <http://10.0.2.15:44992> 8 *0 entries* in >>>> >>>> 0.01 secs, nbTimes = [6 6 6.00 1] >>>> >>>> No exceptions otherwise. Really appreciate all the ongoing help. >>>> >>>> Best, >>>> >>>> -Mike >>>> >>>> >>>> On Mon, Jul 14, 2014 at 6:40 PM, William Slacum >>>> <wilhelm.von.cl...@accumulo.net >>>> <mailto:wilhelm.von.cl...@accumulo.net>> wrote: >>>> >>>> Anything in your Tserver log? I think you should just >>>> rethrow that IOExcepton on your source's next() method, >>>> since they're usually not recoverable (ie, just make >>>> Counter#next throw IOException) >>>> >>>> >>>> On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser >>>> <josh.el...@gmail.com <mailto:josh.el...@gmail.com>> wrote: >>>> >>>> A quick sanity check is to make sure you have data in >>>> the table and that you can read the data without your >>>> iterator (I've thought I had a bug because I didn't have >>>> proper visibilities more times than I'd like to admit). >>>> >>>> Alternatively, you can also enable remote-debugging via >>>> Eclipse into the TabletServer which might help you >>>> understand more of what's going on. >>>> >>>> Lots of articles on how to set this up [1]. In short, >>>> add -Xdebug >>>> -Xrunjdwp:transport=dt_socket,__server=y,address=8000 >>>> to >>>> >>>> ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the >>>> tserver, connect eclipse to 8000 via the Debug >>>> configuration menu, set a breakpoint in your init, seek >>>> and next methods, and `scan` in the shell. >>>> >>>> >>>> [1] >>>> http://javarevisited.blogspot. >>>> __com/2011/02/how-to-setup-__remote-debugging-in.html >>>> >>>> <http://javarevisited.blogspot.com/2011/02/how-to- >>>> setup-remote-debugging-in.html> >>>> >>>> >>>> On 7/14/14, 5:33 PM, Michael Moss wrote: >>>> >>>> Hmm...Still doesn't return anything from the shell. >>>> >>>> http://pastebin.com/ndRhspf8 >>>> >>>> Any thoughts? What's the best way to debug these? >>>> >>>> >>>> On Mon, Jul 14, 2014 at 5:14 PM, William Slacum >>>> <wilhelm.von.cloud@accumulo.__net >>>> <mailto:wilhelm.von.cl...@accumulo.net> >>>> <mailto:wilhelm.von.cloud@__accumulo.net >>>> >>>> <mailto:wilhelm.von.cl...@accumulo.net>>> >>>> >>>> wrote: >>>> >>>> Ah, an artifact of me just willy nilly writing >>>> an iterator :) Any >>>> reference to `this.source` should be replaced >>>> with >>>> `this.getSource()`. In `next()`, your >>>> workaround ends up calling >>>> `this.hasTop()` as the while loop condition. It >>>> will always return >>>> false because two lines up we set `top_key` to >>>> null. We need to make >>>> sure that the source iterator has a top, >>>> because we want to read >>>> data from it. We'll have to change the loop >>>> condition to >>>> `while(this.getSource().__hasTop())`. On line >>>> >>>> 38 of your code we'll >>>> need to call `this.getSource().next()` instead >>>> of `this.next()`. >>>> >>>> The iterator interface is documented, but there >>>> hasn't been a >>>> definitive go-to for making one. I've been >>>> drafting a blog post, but >>>> since it doesn't exist yet, hopefully the >>>> following will suffice. >>>> >>>> The lifetime of an iterator is (usually) as >>>> follows: >>>> >>>> (1) A new instance is called via >>>> Class.newInstance (so a no-args >>>> constructor is needed) >>>> (2) Init is called. This allows users to >>>> configure the iterator, set >>>> its source, and possible check the environment. >>>> We can also call >>>> `deepCopy` on the source if we want to have >>>> multiple sources (we'd >>>> do this if we wanted to do a merge read out of >>>> multiple column >>>> families within a row). >>>> (3) seek() is called. This gets our readers to >>>> the correct positions >>>> in the data that are within the scan range the >>>> user requested, as >>>> well as turning column families on or off. The >>>> name should >>>> reminiscent of seeking to some key on disk. >>>> (4) hasTop() is called. If true, that means we >>>> have data, and the >>>> iterator has a key/value pair that can be >>>> retrieved by calling >>>> getTopKey() and getTopValue(). If fasle, we're >>>> done because there's >>>> no data to return. >>>> (5) next() is called. This will attempt find a >>>> new top key and >>>> value. We go back to (4) to see if next was >>>> successful in finding a >>>> new top key/value and will repeat until the >>>> client is satisfied or >>>> hasTop() returns false. >>>> >>>> You can kind of make a state machine out of >>>> those steps where we >>>> loop between (4) and (5) until there's no data. >>>> There are more >>>> advanced workflows where next() can be reading >>>> from multiple >>>> sources, as well as seeking them to different >>>> positions in the tablet. >>>> >>>> >>>> On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss >>>> <michael.m...@gmail.com >>>> <mailto:michael.m...@gmail.com> >>>> <mailto:michael.m...@gmail.com >>>> >>>> <mailto:michael.m...@gmail.com>__>> wrote: >>>> >>>> Thanks, William. I was just hitting you up >>>> for an example :) >>>> >>>> I adapted your pseudocode >>>> (http://pastebin.com/ufPJq0g3)__, but >>>> >>>> noticed that "this.source" in your example >>>> didn't have >>>> visibility. Did I worked around it >>>> correctly? >>>> >>>> When I add my iterator to my table and run >>>> scan from the shell, >>>> it returns nothing - what should I expect >>>> here? In general I've >>>> found the iterator interface pretty >>>> confusing and haven't spent >>>> the time wrapping my head around it yet. >>>> Any documentation or >>>> examples (beyond what I could find on the >>>> site or in the code) >>>> appreciated! >>>> >>>> /root@dev> table pojo/ >>>> /root@dev pojo> listiter -scan -t pojo/ >>>> /-/ >>>> /- Iterator counter, scan scope >>>> options:/ >>>> /- iteratorPriority = 10/ >>>> /- iteratorClassName = >>>> iterators.Counter/ >>>> /-/ >>>> /root@dev pojo> scan/ >>>> /root@dev pojo>/ >>>> >>>> >>>> Best, >>>> >>>> -Mike >>>> >>>> >>>> >>>> >>>> On Mon, Jul 14, 2014 at 4:07 PM, William >>>> Slacum >>>> <wilhelm.von.cloud@accumulo.__net >>>> <mailto:wilhelm.von.cl...@accumulo.net> >>>> <mailto:wilhelm.von.cloud@__accumulo.net >>>> >>>> <mailto:wilhelm.von.cl...@accumulo.net>>> wrote: >>>> >>>> For a bit of psuedocode, I'd probably >>>> make a class that did >>>> something akin to: >>>> http://pastebin.com/pKqAeeCR >>>> >>>> I wrote that up real quick in a text >>>> editor-- it won't >>>> compile or anything, but should point >>>> you in the right >>>> direction. >>>> >>>> >>>> On Mon, Jul 14, 2014 at 3:44 PM, >>>> William Slacum >>>> <wilhelm.von.cloud@accumulo.__net >>>> <mailto:wilhelm.von.cl...@accumulo.net> >>>> >>>> <mailto:wilhelm.von.cloud@__accumulo.net >>>> >>>> <mailto:wilhelm.von.cl...@accumulo.net>>> wrote: >>>> >>>> Hi Mike! >>>> >>>> The Combiner interface is only for >>>> aggregating keys >>>> within a single row. You can >>>> probably get away with >>>> implementing your combining logic >>>> in a WrappingIterator >>>> that reads across all the rows in a >>>> given tablet. >>>> >>>> To do some combine/fold/reduce >>>> operation, Accumulo needs >>>> the input type to be the same as >>>> the output type. The >>>> combiner doesn't have a notion of a >>>> "present" type (as >>>> you'd see in something like >>>> Algebird's Groups), but you >>>> can use another iterator to perform >>>> your transformation. >>>> >>>> If you wanted to extract the >>>> "count" field from your >>>> Avro object, you could write a new >>>> Iterator that took >>>> your Avro object, extracted the >>>> desired field, and >>>> returned it as its top value. You >>>> can then set this >>>> iterator as the source of the >>>> aggregator, either >>>> programmatically or via by wrapping >>>> the source object >>>> passed to the aggregator in its >>>> SortedKeyValueIterator#init call. >>>> >>>> This is a bit inefficient as you'd >>>> have to serialize to >>>> a Value and then immediately >>>> deserialize it in the >>>> iterator above it. You could >>>> mitigate this by exposing a >>>> method that would get the extracted >>>> value before >>>> serializing it. >>>> >>>> This kind of counting also requires >>>> client side logic to >>>> do a final combine operation, since >>>> the aggregations >>>> from all the tservers are partial >>>> results. >>>> >>>> I believe that CountingIterator is >>>> not meant for user >>>> consumption, but I do not know if >>>> it's related to your >>>> issue in trying to use it from the >>>> shell. Iterators set >>>> through the shell, in previous >>>> versions of Accumulo, >>>> have a requirement to implement >>>> OptionDescriber. Many >>>> default iterators do not implement >>>> this, and thus can't >>>> set in the shell. >>>> >>>> >>>> >>>> On Mon, Jul 14, 2014 at 2:44 PM, >>>> Michael Moss >>>> <michael.m...@gmail.com >>>> <mailto:michael.m...@gmail.com> >>>> <mailto:michael.m...@gmail.com >>>> <mailto:michael.m...@gmail.com>__>> >>>> >>>> >>>> wrote: >>>> >>>> Hi, All. >>>> >>>> I'm curious what the best >>>> practices are around >>>> persisting complex types/data >>>> in Accumulo (and >>>> aggregating on fields within >>>> them). >>>> >>>> Let's say I have (row, column >>>> family, column >>>> qualifier, value): >>>> "A" "foo" "" >>>> MyHugeAvroObject(count=2) >>>> "A" "foo" "" >>>> MyHugeAvroObject(count=3) >>>> >>>> Let's say MyHugeAvroObject has >>>> a field "Integer >>>> count" with the values above. >>>> >>>> What is the best way to >>>> aggregate on row, column >>>> family, column qualifier by >>>> count? In my above example: >>>> "A" "foo" "" 5 >>>> >>>> The >>>> TypedValueCombiner.typedReduce method can >>>> deserialize any "V", in my case >>>> MyHugeAvroObject, >>>> but it needs to return a value >>>> of type "V". What are >>>> the best practices for deeply >>>> nested/complex >>>> objects? It's not always >>>> straightforward to map a >>>> complex Avro type into Row -> >>>> Column Family -> >>>> Column Qualifier. >>>> >>>> Rather than using a >>>> TypedCombiner, I looked into >>>> using an Aggregator (which >>>> appears deprecated as of >>>> 1.4), which appears to let me >>>> return arbitrary >>>> values, but despite running >>>> setiter, my aggregator >>>> doesn't seem to do anything. >>>> >>>> I also tried looking at >>>> implementing a >>>> WrappingIterator, which also >>>> appears to allow me to >>>> return arbitary values (such as >>>> Accumulo's >>>> CountingIterator), but I get >>>> cryptic errors when >>>> trying to setiter, I'm on >>>> Accumulo 1.6: >>>> >>>> root@dev kyt> setiter -t kyt >>>> -scan -p 10 -n >>>> countingIter -class >>>> >>>> org.apache.accumulo.core.__iterators.system.__ >>>> CountingIterator >>>> >>>> 2014-07-14 11:12:55,623 >>>> [shell.Shell] ERROR: >>>> >>>> java.lang.__IllegalArgumentException: >>>> >>>> org.apache.accumulo.core.__iterators.system.__ >>>> CountingIterator >>>> >>>> >>>> This is odd because other >>>> included implementations >>>> of WrappingIterator seem to >>>> work (perhaps the >>>> implementation of >>>> CountingIterator is dated): >>>> root@dev kyt> setiter -t kyt >>>> -scan -p 10 -n >>>> deletingIterator -class >>>> >>>> org.apache.accumulo.core.__iterators.system.__ >>>> DeletingIterator >>>> >>>> The iterator class does not >>>> implement >>>> OptionDescriber. Consider this >>>> for better iterator >>>> configuration using this >>>> setiter command. >>>> Name for iterator (enter to >>>> skip): >>>> >>>> All in all, how can I aggregate >>>> simple values, like >>>> counters from rows with complex >>>> Avro objects as >>>> Values without having to add >>>> aggregations fields to >>>> these Value objects? >>>> >>>> Thanks! >>>> >>>> -Mike >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >