Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-09-07 Thread Michael Moss
All, thanks again for your feedback. I just consolidated some of these
learnings with some code samples here.

http://www.mammothdatallc.com/blog/accumulo-in-depth-look-at-filters-combiners-iterators-against-complex-values/

Best,

-Mike

On Fri, Jul 18, 2014 at 11:54 AM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 Oh wow, I have totally read your problem incorrectly then. I thought you
 wanted a total count across rows for some reasoning (when you mentioned you
 had versioning turned off, things clicked).

 You can use a combiner, but I'd write an iterator that strips out the
 count field for each value (like we did the other iterator), and then place
 that lower in the iterator stack. This way you can get around your original
 issue with the combiner only taking a single input/output type.


 On Tue, Jul 15, 2014 at 2:25 PM, Adam Fuchs afu...@apache.org wrote:

 Mike,

 The way we usually aggregate by row is to check the source's top key
 within the next function to see if it breaks the row boundary. If your
 source starts giving you data in the next row then break out of the loop in
 the next function. You'll also need to construct a row key to return from
 your iterator and then handle the reseeking case (automatic seeking to
 second key in row). See the RowEncodingIterator for hints on
 implementation. You might actually want to subclass RowEncodingIterator to
 implement your counter.

 Cheers,
 Adam
  Cool. I'll write something up and share.

 I'm curious how to get my Counter (WrappingIterator) implementation to
 aggregate by row (which, for some reason, I assumed was default?)

 Let's say I have rows (and CF=, CQ= and versioningiterator off):
 1 (Value1, Value 2...Value N)
 2
 3

 How can my iterator return?
 1 (Count of values 1..N)
 2 (Count of values 1..N)
 3 ...

 I tried scan -b 1 -e 1 and it counts an individual row. But if I
 don't specify anything, it returns,
 3 (Count of all values across all rows)

 Code:
 http://pastebin.com/8xFNLHFS

 Example:
 root@dev pe listiter -scan -t pojo
 -
 -Iterator counter, scan scope options:
 -iteratorPriority = 10
 -iteratorClassName = iterators.Counter
 -
 root@dev pe scan -b 1_1_20140101 -e 1_1_20140101
 1_1_20140101 : [public]65

 root@dev pe scan -b 1_1_20140101 -e 3_9_20140727
 3_9_20140727 : [public]10

 root@dev pe scan
 3_9_20140727 : [public]10


 Thanks.

 -Mike



  On Tue, Jul 15, 2014 at 12:29 PM, Josh Elser josh.el...@gmail.com
 wrote:

 There's been some mention about a desire to rethink the Iterator
 interface as it has some deficiencies (notably the lack of a cleanup
 before the iterators are torn down), but no one has stated that they're
 actively working on this.

 Getting better documentation wrt to convetions: let us know where the
 Accumulo documentation falls short (and give us patches to fix the
 documentation :D). Additionally, write up your own findings from problems
 that you've run into. It's the entire community (users specifically) that
 we need to help encourage to grow.

 Even things as simple as how do I count entries in an iterator are big
 as you are now an expert on the subject :)


 On 7/15/14, 12:17 PM, Michael Moss wrote:

 That worked ;) - Thanks!

 What a journey...

 I like Accumulo's architecture and promise, but the difficulty in
 querying it (lack of documentation, conventions) is a major concern and
 I'd imagine has to have an impact on adoption. I'm curious if there have
 been any conversations around changing the interface around iterators
 which are still confusing to me. Let me know how I can help!


 On Tue, Jul 15, 2014 at 12:03 PM, William Slacum
 wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net
 

 wrote:

 Herp... serves me right for not setting up a proper test case.

 I think you need to override seek as well:

 @Override
 public void seek(...) throws IOException {
super.seek(...);
next();
 }

 I think I just realized the wrapping iterator could use some clean
 up, because this isn't obvious. Basically after the wrapping
 iterator's seek is called, it never calls the implementor's next()
 to actually set up the first top key and value.



 On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss
 michael.m...@gmail.com mailto:michael.m...@gmail.com wrote:

 I set up debugging and am rethrowing the exception. What's
 strange is it appears that despite the iterator instance being
 properly set to iterator.Counter (my implementation), my
 breakpoints aren't being hit, only in the parent classes
 (Wrapping Iterator) and (SortedKeyValueIterator).

 I have two rows in the table, when I scan with no iterator:
 2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted;
 user: root; action: scan; targetTable: pojo; authorizations:
 public,; range: (-inf,+inf); columns: []; iterators: [];
 iteratorOptions: 

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-15 Thread Michael Moss
I set up debugging and am rethrowing the exception. What's strange is it
appears that despite the iterator instance being properly set to
iterator.Counter (my implementation), my breakpoints aren't being hit, only
in the parent classes (Wrapping Iterator) and (SortedKeyValueIterator).

I have two rows in the table, when I scan with no iterator:
2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted; user: root;
action: scan; targetTable: pojo; authorizations: public,; range:
(-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess tid
10.0.2.15:45073 8* 2 entries* in 0.01 secs, nbTimes = [7 7 7.00 1]

When I scan with the iterator (0 entries?):
2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted; user: root;
action: scan; targetTable: pojo; authorizations: public,; range:
(-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess tid
10.0.2.15:44992 8 *0 entries* in 0.01 secs, nbTimes = [6 6 6.00 1]

No exceptions otherwise. Really appreciate all the ongoing help.

Best,

-Mike


On Mon, Jul 14, 2014 at 6:40 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 Anything in your Tserver log? I think you should just rethrow that
 IOExcepton on your source's next() method, since they're usually not
 recoverable (ie, just make Counter#next throw IOException)


 On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser josh.el...@gmail.com wrote:

 A quick sanity check is to make sure you have data in the table and that
 you can read the data without your iterator (I've thought I had a bug
 because I didn't have proper visibilities more times than I'd like to
 admit).

 Alternatively, you can also enable remote-debugging via Eclipse into the
 TabletServer which might help you understand more of what's going on.

 Lots of articles on how to set this up [1]. In short, add -Xdebug
 -Xrunjdwp:transport=dt_socket,server=y,address=8000 to
 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connect
 eclipse to 8000 via the Debug configuration menu, set a breakpoint in your
 init, seek and next methods, and `scan` in the shell.


 [1] http://javarevisited.blogspot.com/2011/02/how-to-setup-
 remote-debugging-in.html


 On 7/14/14, 5:33 PM, Michael Moss wrote:

 Hmm...Still doesn't return anything from the shell.

 http://pastebin.com/ndRhspf8

 Any thoughts? What's the best way to debug these?


 On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
 wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net

 wrote:

 Ah, an artifact of me just willy nilly writing an iterator :) Any
 reference to `this.source` should be replaced with
 `this.getSource()`. In `next()`, your workaround ends up calling
 `this.hasTop()` as the while loop condition. It will always return
 false because two lines up we set `top_key` to null. We need to make
 sure that the source iterator has a top, because we want to read
 data from it. We'll have to change the loop condition to
 `while(this.getSource().hasTop())`. On line 38 of your code we'll
 need to call `this.getSource().next()` instead of `this.next()`.

 The iterator interface is documented, but there hasn't been a
 definitive go-to for making one. I've been drafting a blog post, but
 since it doesn't exist yet, hopefully the following will suffice.

 The lifetime of an iterator is (usually) as follows:

 (1) A new instance is called via Class.newInstance (so a no-args
 constructor is needed)
 (2) Init is called. This allows users to configure the iterator, set
 its source, and possible check the environment. We can also call
 `deepCopy` on the source if we want to have multiple sources (we'd
 do this if we wanted to do a merge read out of multiple column
 families within a row).
 (3) seek() is called. This gets our readers to the correct positions
 in the data that are within the scan range the user requested, as
 well as turning column families on or off. The name should
 reminiscent of seeking to some key on disk.
 (4) hasTop() is called. If true, that means we have data, and the
 iterator has a key/value pair that can be retrieved by calling
 getTopKey() and getTopValue(). If fasle, we're done because there's
 no data to return.
 (5) next() is called. This will attempt find a new top key and
 value. We go back to (4) to see if next was successful in finding a
 new top key/value and will repeat until the client is satisfied or
 hasTop() returns false.

 You can kind of make a state machine out of those steps where we
 loop between (4) and (5) until there's no data. There are more
 advanced workflows where next() can be reading from multiple
 sources, as well as seeking them to different positions in the
 tablet.


 On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
 

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-15 Thread William Slacum
Herp... serves me right for not setting up a proper test case.

I think you need to override seek as well:

@Override
public void seek(...) throws IOException {
  super.seek(...);
  next();
}

I think I just realized the wrapping iterator could use some clean up,
because this isn't obvious. Basically after the wrapping iterator's seek is
called, it never calls the implementor's next() to actually set up the
first top key and value.



On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss michael.m...@gmail.com
wrote:

 I set up debugging and am rethrowing the exception. What's strange is it
 appears that despite the iterator instance being properly set to
 iterator.Counter (my implementation), my breakpoints aren't being hit, only
 in the parent classes (Wrapping Iterator) and (SortedKeyValueIterator).

 I have two rows in the table, when I scan with no iterator:
 2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted; user:
 root; action: scan; targetTable: pojo; authorizations: public,; range:
 (-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
 2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess tid
 10.0.2.15:45073 8* 2 entries* in 0.01 secs, nbTimes = [7 7 7.00 1]

 When I scan with the iterator (0 entries?):
 2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted; user:
 root; action: scan; targetTable: pojo; authorizations: public,; range:
 (-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
 2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess tid
 10.0.2.15:44992 8 *0 entries* in 0.01 secs, nbTimes = [6 6 6.00 1]

 No exceptions otherwise. Really appreciate all the ongoing help.

 Best,

 -Mike


 On Mon, Jul 14, 2014 at 6:40 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 Anything in your Tserver log? I think you should just rethrow that
 IOExcepton on your source's next() method, since they're usually not
 recoverable (ie, just make Counter#next throw IOException)


 On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser josh.el...@gmail.com wrote:

 A quick sanity check is to make sure you have data in the table and that
 you can read the data without your iterator (I've thought I had a bug
 because I didn't have proper visibilities more times than I'd like to
 admit).

 Alternatively, you can also enable remote-debugging via Eclipse into the
 TabletServer which might help you understand more of what's going on.

 Lots of articles on how to set this up [1]. In short, add -Xdebug
 -Xrunjdwp:transport=dt_socket,server=y,address=8000 to
 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connect
 eclipse to 8000 via the Debug configuration menu, set a breakpoint in your
 init, seek and next methods, and `scan` in the shell.


 [1] http://javarevisited.blogspot.com/2011/02/how-to-setup-
 remote-debugging-in.html


 On 7/14/14, 5:33 PM, Michael Moss wrote:

 Hmm...Still doesn't return anything from the shell.

 http://pastebin.com/ndRhspf8

 Any thoughts? What's the best way to debug these?


 On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
 wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net
 

 wrote:

 Ah, an artifact of me just willy nilly writing an iterator :) Any
 reference to `this.source` should be replaced with
 `this.getSource()`. In `next()`, your workaround ends up calling
 `this.hasTop()` as the while loop condition. It will always return
 false because two lines up we set `top_key` to null. We need to make
 sure that the source iterator has a top, because we want to read
 data from it. We'll have to change the loop condition to
 `while(this.getSource().hasTop())`. On line 38 of your code we'll
 need to call `this.getSource().next()` instead of `this.next()`.

 The iterator interface is documented, but there hasn't been a
 definitive go-to for making one. I've been drafting a blog post, but
 since it doesn't exist yet, hopefully the following will suffice.

 The lifetime of an iterator is (usually) as follows:

 (1) A new instance is called via Class.newInstance (so a no-args
 constructor is needed)
 (2) Init is called. This allows users to configure the iterator, set
 its source, and possible check the environment. We can also call
 `deepCopy` on the source if we want to have multiple sources (we'd
 do this if we wanted to do a merge read out of multiple column
 families within a row).
 (3) seek() is called. This gets our readers to the correct positions
 in the data that are within the scan range the user requested, as
 well as turning column families on or off. The name should
 reminiscent of seeking to some key on disk.
 (4) hasTop() is called. If true, that means we have data, and the
 iterator has a key/value pair that can be retrieved by calling
 getTopKey() and getTopValue(). If fasle, we're done because there's
 no data to return.
 (5) next() is called. This will attempt find a new 

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-15 Thread Michael Moss
That worked ;) - Thanks!

What a journey...

I like Accumulo's architecture and promise, but the difficulty in querying
it (lack of documentation, conventions) is a major concern and I'd imagine
has to have an impact on adoption. I'm curious if there have been any
conversations around changing the interface around iterators which are
still confusing to me. Let me know how I can help!


On Tue, Jul 15, 2014 at 12:03 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 Herp... serves me right for not setting up a proper test case.

 I think you need to override seek as well:

 @Override
 public void seek(...) throws IOException {
   super.seek(...);
   next();
 }

 I think I just realized the wrapping iterator could use some clean up,
 because this isn't obvious. Basically after the wrapping iterator's seek is
 called, it never calls the implementor's next() to actually set up the
 first top key and value.



 On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss michael.m...@gmail.com
 wrote:

 I set up debugging and am rethrowing the exception. What's strange is it
 appears that despite the iterator instance being properly set to
 iterator.Counter (my implementation), my breakpoints aren't being hit, only
 in the parent classes (Wrapping Iterator) and (SortedKeyValueIterator).

 I have two rows in the table, when I scan with no iterator:
 2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted; user:
 root; action: scan; targetTable: pojo; authorizations: public,; range:
 (-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
 2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess tid
 10.0.2.15:45073 8* 2 entries* in 0.01 secs, nbTimes = [7 7 7.00 1]

 When I scan with the iterator (0 entries?):
 2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted; user:
 root; action: scan; targetTable: pojo; authorizations: public,; range:
 (-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
 2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess tid
 10.0.2.15:44992 8 *0 entries* in 0.01 secs, nbTimes = [6 6 6.00 1]

 No exceptions otherwise. Really appreciate all the ongoing help.

 Best,

 -Mike


 On Mon, Jul 14, 2014 at 6:40 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 Anything in your Tserver log? I think you should just rethrow that
 IOExcepton on your source's next() method, since they're usually not
 recoverable (ie, just make Counter#next throw IOException)


 On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser josh.el...@gmail.com
 wrote:

 A quick sanity check is to make sure you have data in the table and
 that you can read the data without your iterator (I've thought I had a bug
 because I didn't have proper visibilities more times than I'd like to
 admit).

 Alternatively, you can also enable remote-debugging via Eclipse into
 the TabletServer which might help you understand more of what's going on.

 Lots of articles on how to set this up [1]. In short, add -Xdebug
 -Xrunjdwp:transport=dt_socket,server=y,address=8000 to
 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connect
 eclipse to 8000 via the Debug configuration menu, set a breakpoint in your
 init, seek and next methods, and `scan` in the shell.


 [1] http://javarevisited.blogspot.com/2011/02/how-to-setup-
 remote-debugging-in.html


 On 7/14/14, 5:33 PM, Michael Moss wrote:

 Hmm...Still doesn't return anything from the shell.

 http://pastebin.com/ndRhspf8

 Any thoughts? What's the best way to debug these?


 On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
 wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net
 

 wrote:

 Ah, an artifact of me just willy nilly writing an iterator :) Any
 reference to `this.source` should be replaced with
 `this.getSource()`. In `next()`, your workaround ends up calling
 `this.hasTop()` as the while loop condition. It will always return
 false because two lines up we set `top_key` to null. We need to
 make
 sure that the source iterator has a top, because we want to read
 data from it. We'll have to change the loop condition to
 `while(this.getSource().hasTop())`. On line 38 of your code we'll
 need to call `this.getSource().next()` instead of `this.next()`.

 The iterator interface is documented, but there hasn't been a
 definitive go-to for making one. I've been drafting a blog post,
 but
 since it doesn't exist yet, hopefully the following will suffice.

 The lifetime of an iterator is (usually) as follows:

 (1) A new instance is called via Class.newInstance (so a no-args
 constructor is needed)
 (2) Init is called. This allows users to configure the iterator,
 set
 its source, and possible check the environment. We can also call
 `deepCopy` on the source if we want to have multiple sources (we'd
 do this if we wanted to do a merge read out of multiple column
 families within a row).
 (3) seek() is called. This gets our readers to the 

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-15 Thread Josh Elser
There's been some mention about a desire to rethink the Iterator 
interface as it has some deficiencies (notably the lack of a cleanup 
before the iterators are torn down), but no one has stated that they're 
actively working on this.


Getting better documentation wrt to convetions: let us know where the 
Accumulo documentation falls short (and give us patches to fix the 
documentation :D). Additionally, write up your own findings from 
problems that you've run into. It's the entire community (users 
specifically) that we need to help encourage to grow.


Even things as simple as how do I count entries in an iterator are big 
as you are now an expert on the subject :)


On 7/15/14, 12:17 PM, Michael Moss wrote:

That worked ;) - Thanks!

What a journey...

I like Accumulo's architecture and promise, but the difficulty in
querying it (lack of documentation, conventions) is a major concern and
I'd imagine has to have an impact on adoption. I'm curious if there have
been any conversations around changing the interface around iterators
which are still confusing to me. Let me know how I can help!


On Tue, Jul 15, 2014 at 12:03 PM, William Slacum
wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net
wrote:

Herp... serves me right for not setting up a proper test case.

I think you need to override seek as well:

@Override
public void seek(...) throws IOException {
   super.seek(...);
   next();
}

I think I just realized the wrapping iterator could use some clean
up, because this isn't obvious. Basically after the wrapping
iterator's seek is called, it never calls the implementor's next()
to actually set up the first top key and value.



On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss
michael.m...@gmail.com mailto:michael.m...@gmail.com wrote:

I set up debugging and am rethrowing the exception. What's
strange is it appears that despite the iterator instance being
properly set to iterator.Counter (my implementation), my
breakpoints aren't being hit, only in the parent classes
(Wrapping Iterator) and (SortedKeyValueIterator).

I have two rows in the table, when I scan with no iterator:
2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted;
user: root; action: scan; targetTable: pojo; authorizations:
public,; range: (-inf,+inf); columns: []; iterators: [];
iteratorOptions: {};
2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess
tid 10.0.2.15:45073 http://10.0.2.15:45073 8*2 entries* in
0.01 secs, nbTimes = [7 7 7.00 1]

When I scan with the iterator (0 entries?):
2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted;
user: root; action: scan; targetTable: pojo; authorizations:
public,; range: (-inf,+inf); columns: []; iterators: [];
iteratorOptions: {};
2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess
tid 10.0.2.15:44992 http://10.0.2.15:44992 8 *0 entries* in
0.01 secs, nbTimes = [6 6 6.00 1]

No exceptions otherwise. Really appreciate all the ongoing help.

Best,

-Mike


On Mon, Jul 14, 2014 at 6:40 PM, William Slacum
wilhelm.von.cl...@accumulo.net
mailto:wilhelm.von.cl...@accumulo.net wrote:

Anything in your Tserver log? I think you should just
rethrow that IOExcepton on your source's next() method,
since they're usually not recoverable (ie, just make
Counter#next throw IOException)


On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser
josh.el...@gmail.com mailto:josh.el...@gmail.com wrote:

A quick sanity check is to make sure you have data in
the table and that you can read the data without your
iterator (I've thought I had a bug because I didn't have
proper visibilities more times than I'd like to admit).

Alternatively, you can also enable remote-debugging via
Eclipse into the TabletServer which might help you
understand more of what's going on.

Lots of articles on how to set this up [1]. In short,
add -Xdebug
-Xrunjdwp:transport=dt_socket,__server=y,address=8000 to
ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the
tserver, connect eclipse to 8000 via the Debug
configuration menu, set a breakpoint in your init, seek
and next methods, and `scan` in the shell.


[1]

http://javarevisited.blogspot.__com/2011/02/how-to-setup-__remote-debugging-in.html

http://javarevisited.blogspot.com/2011/02/how-to-setup-remote-debugging-in.html


On 7/14/14, 5:33 PM, Michael Moss wrote:

Hmm...Still doesn't return anything from 

Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread Michael Moss
Hi, All.

I'm curious what the best practices are around persisting complex
types/data in Accumulo (and aggregating on fields within them).

Let's say I have (row, column family, column qualifier, value):
A foo  MyHugeAvroObject(count=2)
A foo  MyHugeAvroObject(count=3)

Let's say MyHugeAvroObject has a field Integer count with the values
above.

What is the best way to aggregate on row, column family, column qualifier
by count? In my above example:
A foo  5

The TypedValueCombiner.typedReduce method can deserialize any V, in my
case MyHugeAvroObject, but it needs to return a value of type V. What are
the best practices for deeply nested/complex objects? It's not always
straightforward to map a complex Avro type into Row - Column Family -
Column Qualifier.

Rather than using a TypedCombiner, I looked into using an Aggregator (which
appears deprecated as of 1.4), which appears to let me return arbitrary
values, but despite running setiter, my aggregator doesn't seem to do
anything.

I also tried looking at implementing a WrappingIterator, which also appears
to allow me to return arbitary values (such as Accumulo's
CountingIterator), but I get cryptic errors when trying to setiter, I'm on
Accumulo 1.6:

root@dev kyt setiter -t kyt -scan -p 10 -n countingIter -class
org.apache.accumulo.core.iterators.system.CountingIterator
2014-07-14 11:12:55,623 [shell.Shell] ERROR:
java.lang.IllegalArgumentException:
org.apache.accumulo.core.iterators.system.CountingIterator

This is odd because other included implementations of WrappingIterator seem
to work (perhaps the implementation of CountingIterator is dated):
root@dev kyt setiter -t kyt -scan -p 10 -n deletingIterator -class
org.apache.accumulo.core.iterators.system.DeletingIterator
The iterator class does not implement OptionDescriber. Consider this for
better iterator configuration using this setiter command.
Name for iterator (enter to skip):

All in all, how can I aggregate simple values, like counters from rows with
complex Avro objects as Values without having to add aggregations fields to
these Value objects?

Thanks!

-Mike


Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread William Slacum
Hi Mike!

The Combiner interface is only for aggregating keys within a single row.
You can probably get away with implementing your combining logic in a
WrappingIterator that reads across all the rows in a given tablet.

To do some combine/fold/reduce operation, Accumulo needs the input type to
be the same as the output type. The combiner doesn't have a notion of a
present type (as you'd see in something like Algebird's Groups), but you
can use another iterator to perform your transformation.

If you wanted to extract the count field from your Avro object, you could
write a new Iterator that took your Avro object, extracted the desired
field, and returned it as its top value. You can then set this iterator as
the source of the aggregator, either programmatically or via by wrapping
the source object passed to the aggregator in its
SortedKeyValueIterator#init call.

This is a bit inefficient as you'd have to serialize to a Value and then
immediately deserialize it in the iterator above it. You could mitigate
this by exposing a method that would get the extracted value before
serializing it.

This kind of counting also requires client side logic to do a final combine
operation, since the aggregations from all the tservers are partial results.

I believe that CountingIterator is not meant for user consumption, but I do
not know if it's related to your issue in trying to use it from the shell.
Iterators set through the shell, in previous versions of Accumulo, have a
requirement to implement OptionDescriber. Many default iterators do not
implement this, and thus can't set in the shell.



On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss michael.m...@gmail.com
wrote:

 Hi, All.

 I'm curious what the best practices are around persisting complex
 types/data in Accumulo (and aggregating on fields within them).

 Let's say I have (row, column family, column qualifier, value):
 A foo  MyHugeAvroObject(count=2)
 A foo  MyHugeAvroObject(count=3)

 Let's say MyHugeAvroObject has a field Integer count with the values
 above.

 What is the best way to aggregate on row, column family, column qualifier
 by count? In my above example:
 A foo  5

 The TypedValueCombiner.typedReduce method can deserialize any V, in my
 case MyHugeAvroObject, but it needs to return a value of type V. What are
 the best practices for deeply nested/complex objects? It's not always
 straightforward to map a complex Avro type into Row - Column Family -
 Column Qualifier.

 Rather than using a TypedCombiner, I looked into using an Aggregator
 (which appears deprecated as of 1.4), which appears to let me return
 arbitrary values, but despite running setiter, my aggregator doesn't seem
 to do anything.

 I also tried looking at implementing a WrappingIterator, which also
 appears to allow me to return arbitary values (such as Accumulo's
 CountingIterator), but I get cryptic errors when trying to setiter, I'm on
 Accumulo 1.6:

 root@dev kyt setiter -t kyt -scan -p 10 -n countingIter -class
 org.apache.accumulo.core.iterators.system.CountingIterator
 2014-07-14 11:12:55,623 [shell.Shell] ERROR:
 java.lang.IllegalArgumentException:
 org.apache.accumulo.core.iterators.system.CountingIterator

 This is odd because other included implementations of WrappingIterator
 seem to work (perhaps the implementation of CountingIterator is dated):
 root@dev kyt setiter -t kyt -scan -p 10 -n deletingIterator -class
 org.apache.accumulo.core.iterators.system.DeletingIterator
 The iterator class does not implement OptionDescriber. Consider this for
 better iterator configuration using this setiter command.
 Name for iterator (enter to skip):

 All in all, how can I aggregate simple values, like counters from rows
 with complex Avro objects as Values without having to add aggregations
 fields to these Value objects?

 Thanks!

 -Mike



Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread William Slacum
For a bit of psuedocode, I'd probably make a class that did something akin
to: http://pastebin.com/pKqAeeCR

I wrote that up real quick in a text editor-- it won't compile or anything,
but should point you in the right direction.


On Mon, Jul 14, 2014 at 3:44 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 Hi Mike!

 The Combiner interface is only for aggregating keys within a single row.
 You can probably get away with implementing your combining logic in a
 WrappingIterator that reads across all the rows in a given tablet.

 To do some combine/fold/reduce operation, Accumulo needs the input type to
 be the same as the output type. The combiner doesn't have a notion of a
 present type (as you'd see in something like Algebird's Groups), but you
 can use another iterator to perform your transformation.

 If you wanted to extract the count field from your Avro object, you
 could write a new Iterator that took your Avro object, extracted the
 desired field, and returned it as its top value. You can then set this
 iterator as the source of the aggregator, either programmatically or via by
 wrapping the source object passed to the aggregator in its
 SortedKeyValueIterator#init call.

 This is a bit inefficient as you'd have to serialize to a Value and then
 immediately deserialize it in the iterator above it. You could mitigate
 this by exposing a method that would get the extracted value before
 serializing it.

 This kind of counting also requires client side logic to do a final
 combine operation, since the aggregations from all the tservers are partial
 results.

 I believe that CountingIterator is not meant for user consumption, but I
 do not know if it's related to your issue in trying to use it from the
 shell. Iterators set through the shell, in previous versions of Accumulo,
 have a requirement to implement OptionDescriber. Many default iterators do
 not implement this, and thus can't set in the shell.



 On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss michael.m...@gmail.com
 wrote:

 Hi, All.

 I'm curious what the best practices are around persisting complex
 types/data in Accumulo (and aggregating on fields within them).

 Let's say I have (row, column family, column qualifier, value):
 A foo  MyHugeAvroObject(count=2)
 A foo  MyHugeAvroObject(count=3)

 Let's say MyHugeAvroObject has a field Integer count with the values
 above.

 What is the best way to aggregate on row, column family, column qualifier
 by count? In my above example:
 A foo  5

 The TypedValueCombiner.typedReduce method can deserialize any V, in my
 case MyHugeAvroObject, but it needs to return a value of type V. What are
 the best practices for deeply nested/complex objects? It's not always
 straightforward to map a complex Avro type into Row - Column Family -
 Column Qualifier.

 Rather than using a TypedCombiner, I looked into using an Aggregator
 (which appears deprecated as of 1.4), which appears to let me return
 arbitrary values, but despite running setiter, my aggregator doesn't seem
 to do anything.

 I also tried looking at implementing a WrappingIterator, which also
 appears to allow me to return arbitary values (such as Accumulo's
 CountingIterator), but I get cryptic errors when trying to setiter, I'm on
 Accumulo 1.6:

 root@dev kyt setiter -t kyt -scan -p 10 -n countingIter -class
 org.apache.accumulo.core.iterators.system.CountingIterator
 2014-07-14 11:12:55,623 [shell.Shell] ERROR:
 java.lang.IllegalArgumentException:
 org.apache.accumulo.core.iterators.system.CountingIterator

 This is odd because other included implementations of WrappingIterator
 seem to work (perhaps the implementation of CountingIterator is dated):
 root@dev kyt setiter -t kyt -scan -p 10 -n deletingIterator -class
 org.apache.accumulo.core.iterators.system.DeletingIterator
 The iterator class does not implement OptionDescriber. Consider this for
 better iterator configuration using this setiter command.
 Name for iterator (enter to skip):

 All in all, how can I aggregate simple values, like counters from rows
 with complex Avro objects as Values without having to add aggregations
 fields to these Value objects?

 Thanks!

 -Mike





Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread Michael Moss
Thanks, William. I was just hitting you up for an example :)

I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but noticed that
this.source in your example didn't have visibility. Did I worked around
it correctly?

When I add my iterator to my table and run scan from the shell, it returns
nothing - what should I expect here? In general I've found the iterator
interface pretty confusing and haven't spent the time wrapping my head
around it yet. Any documentation or examples (beyond what I could find on
the site or in the code) appreciated!

*root@dev table pojo*
*root@dev pojo listiter -scan -t pojo*
*-*
*-Iterator counter, scan scope options:*
*-iteratorPriority = 10*
*-iteratorClassName = iterators.Counter*
*-*
*root@dev pojo scan*
*root@dev pojo*

Best,

-Mike




On Mon, Jul 14, 2014 at 4:07 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 For a bit of psuedocode, I'd probably make a class that did something akin
 to: http://pastebin.com/pKqAeeCR

 I wrote that up real quick in a text editor-- it won't compile or
 anything, but should point you in the right direction.


 On Mon, Jul 14, 2014 at 3:44 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 Hi Mike!

 The Combiner interface is only for aggregating keys within a single row.
 You can probably get away with implementing your combining logic in a
 WrappingIterator that reads across all the rows in a given tablet.

 To do some combine/fold/reduce operation, Accumulo needs the input type
 to be the same as the output type. The combiner doesn't have a notion of a
 present type (as you'd see in something like Algebird's Groups), but you
 can use another iterator to perform your transformation.

 If you wanted to extract the count field from your Avro object, you
 could write a new Iterator that took your Avro object, extracted the
 desired field, and returned it as its top value. You can then set this
 iterator as the source of the aggregator, either programmatically or via by
 wrapping the source object passed to the aggregator in its
 SortedKeyValueIterator#init call.

 This is a bit inefficient as you'd have to serialize to a Value and then
 immediately deserialize it in the iterator above it. You could mitigate
 this by exposing a method that would get the extracted value before
 serializing it.

 This kind of counting also requires client side logic to do a final
 combine operation, since the aggregations from all the tservers are partial
 results.

 I believe that CountingIterator is not meant for user consumption, but I
 do not know if it's related to your issue in trying to use it from the
 shell. Iterators set through the shell, in previous versions of Accumulo,
 have a requirement to implement OptionDescriber. Many default iterators do
 not implement this, and thus can't set in the shell.



 On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss michael.m...@gmail.com
 wrote:

 Hi, All.

 I'm curious what the best practices are around persisting complex
 types/data in Accumulo (and aggregating on fields within them).

 Let's say I have (row, column family, column qualifier, value):
 A foo  MyHugeAvroObject(count=2)
 A foo  MyHugeAvroObject(count=3)

 Let's say MyHugeAvroObject has a field Integer count with the values
 above.

 What is the best way to aggregate on row, column family, column
 qualifier by count? In my above example:
 A foo  5

 The TypedValueCombiner.typedReduce method can deserialize any V, in my
 case MyHugeAvroObject, but it needs to return a value of type V. What are
 the best practices for deeply nested/complex objects? It's not always
 straightforward to map a complex Avro type into Row - Column Family -
 Column Qualifier.

 Rather than using a TypedCombiner, I looked into using an Aggregator
 (which appears deprecated as of 1.4), which appears to let me return
 arbitrary values, but despite running setiter, my aggregator doesn't seem
 to do anything.

 I also tried looking at implementing a WrappingIterator, which also
 appears to allow me to return arbitary values (such as Accumulo's
 CountingIterator), but I get cryptic errors when trying to setiter, I'm on
 Accumulo 1.6:

 root@dev kyt setiter -t kyt -scan -p 10 -n countingIter -class
 org.apache.accumulo.core.iterators.system.CountingIterator
 2014-07-14 11:12:55,623 [shell.Shell] ERROR:
 java.lang.IllegalArgumentException:
 org.apache.accumulo.core.iterators.system.CountingIterator

 This is odd because other included implementations of WrappingIterator
 seem to work (perhaps the implementation of CountingIterator is dated):
 root@dev kyt setiter -t kyt -scan -p 10 -n deletingIterator -class
 org.apache.accumulo.core.iterators.system.DeletingIterator
 The iterator class does not implement OptionDescriber. Consider this for
 better iterator configuration using this setiter command.
 Name for iterator (enter to skip):

 All in all, how can I aggregate simple values, like counters from rows
 with complex Avro objects as 

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread William Slacum
Ah, an artifact of me just willy nilly writing an iterator :) Any reference
to `this.source` should be replaced with `this.getSource()`. In `next()`,
your workaround ends up calling `this.hasTop()` as the while loop
condition. It will always return false because two lines up we set
`top_key` to null. We need to make sure that the source iterator has a top,
because we want to read data from it. We'll have to change the loop
condition to `while(this.getSource().hasTop())`. On line 38 of your code
we'll need to call `this.getSource().next()` instead of `this.next()`.

The iterator interface is documented, but there hasn't been a definitive
go-to for making one. I've been drafting a blog post, but since it doesn't
exist yet, hopefully the following will suffice.

The lifetime of an iterator is (usually) as follows:

(1) A new instance is called via Class.newInstance (so a no-args
constructor is needed)
(2) Init is called. This allows users to configure the iterator, set its
source, and possible check the environment. We can also call `deepCopy` on
the source if we want to have multiple sources (we'd do this if we wanted
to do a merge read out of multiple column families within a row).
(3) seek() is called. This gets our readers to the correct positions in the
data that are within the scan range the user requested, as well as turning
column families on or off. The name should reminiscent of seeking to some
key on disk.
(4) hasTop() is called. If true, that means we have data, and the iterator
has a key/value pair that can be retrieved by calling getTopKey() and
getTopValue(). If fasle, we're done because there's no data to return.
(5) next() is called. This will attempt find a new top key and value. We go
back to (4) to see if next was successful in finding a new top key/value
and will repeat until the client is satisfied or hasTop() returns false.

You can kind of make a state machine out of those steps where we loop
between (4) and (5) until there's no data. There are more advanced
workflows where next() can be reading from multiple sources, as well as
seeking them to different positions in the tablet.


On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss michael.m...@gmail.com
wrote:

 Thanks, William. I was just hitting you up for an example :)

 I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but noticed
 that this.source in your example didn't have visibility. Did I worked
 around it correctly?

 When I add my iterator to my table and run scan from the shell, it returns
 nothing - what should I expect here? In general I've found the iterator
 interface pretty confusing and haven't spent the time wrapping my head
 around it yet. Any documentation or examples (beyond what I could find on
 the site or in the code) appreciated!

 *root@dev table pojo*
 *root@dev pojo listiter -scan -t pojo*
 *-*
 *-Iterator counter, scan scope options:*
 *-iteratorPriority = 10*
 *-iteratorClassName = iterators.Counter*
 *-*
 *root@dev pojo scan*
 *root@dev pojo*

 Best,

 -Mike




 On Mon, Jul 14, 2014 at 4:07 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 For a bit of psuedocode, I'd probably make a class that did something
 akin to: http://pastebin.com/pKqAeeCR

 I wrote that up real quick in a text editor-- it won't compile or
 anything, but should point you in the right direction.


 On Mon, Jul 14, 2014 at 3:44 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 Hi Mike!

 The Combiner interface is only for aggregating keys within a single row.
 You can probably get away with implementing your combining logic in a
 WrappingIterator that reads across all the rows in a given tablet.

 To do some combine/fold/reduce operation, Accumulo needs the input type
 to be the same as the output type. The combiner doesn't have a notion of a
 present type (as you'd see in something like Algebird's Groups), but you
 can use another iterator to perform your transformation.

 If you wanted to extract the count field from your Avro object, you
 could write a new Iterator that took your Avro object, extracted the
 desired field, and returned it as its top value. You can then set this
 iterator as the source of the aggregator, either programmatically or via by
 wrapping the source object passed to the aggregator in its
 SortedKeyValueIterator#init call.

 This is a bit inefficient as you'd have to serialize to a Value and then
 immediately deserialize it in the iterator above it. You could mitigate
 this by exposing a method that would get the extracted value before
 serializing it.

 This kind of counting also requires client side logic to do a final
 combine operation, since the aggregations from all the tservers are partial
 results.

 I believe that CountingIterator is not meant for user consumption, but I
 do not know if it's related to your issue in trying to use it from the
 shell. Iterators set through the shell, in previous versions of Accumulo,
 have a requirement to implement 

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread Michael Moss
Hmm...Still doesn't return anything from the shell.

http://pastebin.com/ndRhspf8

Any thoughts? What's the best way to debug these?


On Mon, Jul 14, 2014 at 5:14 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 Ah, an artifact of me just willy nilly writing an iterator :) Any
 reference to `this.source` should be replaced with `this.getSource()`. In
 `next()`, your workaround ends up calling `this.hasTop()` as the while loop
 condition. It will always return false because two lines up we set
 `top_key` to null. We need to make sure that the source iterator has a top,
 because we want to read data from it. We'll have to change the loop
 condition to `while(this.getSource().hasTop())`. On line 38 of your code
 we'll need to call `this.getSource().next()` instead of `this.next()`.

 The iterator interface is documented, but there hasn't been a definitive
 go-to for making one. I've been drafting a blog post, but since it doesn't
 exist yet, hopefully the following will suffice.

 The lifetime of an iterator is (usually) as follows:

 (1) A new instance is called via Class.newInstance (so a no-args
 constructor is needed)
 (2) Init is called. This allows users to configure the iterator, set its
 source, and possible check the environment. We can also call `deepCopy` on
 the source if we want to have multiple sources (we'd do this if we wanted
 to do a merge read out of multiple column families within a row).
 (3) seek() is called. This gets our readers to the correct positions in
 the data that are within the scan range the user requested, as well as
 turning column families on or off. The name should reminiscent of seeking
 to some key on disk.
 (4) hasTop() is called. If true, that means we have data, and the iterator
 has a key/value pair that can be retrieved by calling getTopKey() and
 getTopValue(). If fasle, we're done because there's no data to return.
 (5) next() is called. This will attempt find a new top key and value. We
 go back to (4) to see if next was successful in finding a new top key/value
 and will repeat until the client is satisfied or hasTop() returns false.

 You can kind of make a state machine out of those steps where we loop
 between (4) and (5) until there's no data. There are more advanced
 workflows where next() can be reading from multiple sources, as well as
 seeking them to different positions in the tablet.


 On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss michael.m...@gmail.com
 wrote:

 Thanks, William. I was just hitting you up for an example :)

 I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but noticed
 that this.source in your example didn't have visibility. Did I worked
 around it correctly?

 When I add my iterator to my table and run scan from the shell, it
 returns nothing - what should I expect here? In general I've found the
 iterator interface pretty confusing and haven't spent the time wrapping my
 head around it yet. Any documentation or examples (beyond what I could find
 on the site or in the code) appreciated!

 *root@dev table pojo*
 *root@dev pojo listiter -scan -t pojo*
 *-*
 *-Iterator counter, scan scope options:*
 *-iteratorPriority = 10*
 *-iteratorClassName = iterators.Counter*
 *-*
 *root@dev pojo scan*
 *root@dev pojo*

 Best,

 -Mike




 On Mon, Jul 14, 2014 at 4:07 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 For a bit of psuedocode, I'd probably make a class that did something
 akin to: http://pastebin.com/pKqAeeCR

 I wrote that up real quick in a text editor-- it won't compile or
 anything, but should point you in the right direction.


 On Mon, Jul 14, 2014 at 3:44 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

 Hi Mike!

 The Combiner interface is only for aggregating keys within a single
 row. You can probably get away with implementing your combining logic in a
 WrappingIterator that reads across all the rows in a given tablet.

 To do some combine/fold/reduce operation, Accumulo needs the input type
 to be the same as the output type. The combiner doesn't have a notion of a
 present type (as you'd see in something like Algebird's Groups), but you
 can use another iterator to perform your transformation.

 If you wanted to extract the count field from your Avro object, you
 could write a new Iterator that took your Avro object, extracted the
 desired field, and returned it as its top value. You can then set this
 iterator as the source of the aggregator, either programmatically or via by
 wrapping the source object passed to the aggregator in its
 SortedKeyValueIterator#init call.

 This is a bit inefficient as you'd have to serialize to a Value and
 then immediately deserialize it in the iterator above it. You could
 mitigate this by exposing a method that would get the extracted value
 before serializing it.

 This kind of counting also requires client side logic to do a final
 combine operation, since the aggregations from all the tservers are partial
 results.

 

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread Josh Elser
A quick sanity check is to make sure you have data in the table and that 
you can read the data without your iterator (I've thought I had a bug 
because I didn't have proper visibilities more times than I'd like to 
admit).


Alternatively, you can also enable remote-debugging via Eclipse into the 
TabletServer which might help you understand more of what's going on.


Lots of articles on how to set this up [1]. In short, add -Xdebug 
-Xrunjdwp:transport=dt_socket,server=y,address=8000 to 
ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connect 
eclipse to 8000 via the Debug configuration menu, set a breakpoint in 
your init, seek and next methods, and `scan` in the shell.



[1] 
http://javarevisited.blogspot.com/2011/02/how-to-setup-remote-debugging-in.html


On 7/14/14, 5:33 PM, Michael Moss wrote:

Hmm...Still doesn't return anything from the shell.

http://pastebin.com/ndRhspf8

Any thoughts? What's the best way to debug these?


On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net
wrote:

Ah, an artifact of me just willy nilly writing an iterator :) Any
reference to `this.source` should be replaced with
`this.getSource()`. In `next()`, your workaround ends up calling
`this.hasTop()` as the while loop condition. It will always return
false because two lines up we set `top_key` to null. We need to make
sure that the source iterator has a top, because we want to read
data from it. We'll have to change the loop condition to
`while(this.getSource().hasTop())`. On line 38 of your code we'll
need to call `this.getSource().next()` instead of `this.next()`.

The iterator interface is documented, but there hasn't been a
definitive go-to for making one. I've been drafting a blog post, but
since it doesn't exist yet, hopefully the following will suffice.

The lifetime of an iterator is (usually) as follows:

(1) A new instance is called via Class.newInstance (so a no-args
constructor is needed)
(2) Init is called. This allows users to configure the iterator, set
its source, and possible check the environment. We can also call
`deepCopy` on the source if we want to have multiple sources (we'd
do this if we wanted to do a merge read out of multiple column
families within a row).
(3) seek() is called. This gets our readers to the correct positions
in the data that are within the scan range the user requested, as
well as turning column families on or off. The name should
reminiscent of seeking to some key on disk.
(4) hasTop() is called. If true, that means we have data, and the
iterator has a key/value pair that can be retrieved by calling
getTopKey() and getTopValue(). If fasle, we're done because there's
no data to return.
(5) next() is called. This will attempt find a new top key and
value. We go back to (4) to see if next was successful in finding a
new top key/value and will repeat until the client is satisfied or
hasTop() returns false.

You can kind of make a state machine out of those steps where we
loop between (4) and (5) until there's no data. There are more
advanced workflows where next() can be reading from multiple
sources, as well as seeking them to different positions in the tablet.


On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
michael.m...@gmail.com mailto:michael.m...@gmail.com wrote:

Thanks, William. I was just hitting you up for an example :)

I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but
noticed that this.source in your example didn't have
visibility. Did I worked around it correctly?

When I add my iterator to my table and run scan from the shell,
it returns nothing - what should I expect here? In general I've
found the iterator interface pretty confusing and haven't spent
the time wrapping my head around it yet. Any documentation or
examples (beyond what I could find on the site or in the code)
appreciated!

/root@dev table pojo/
/root@dev pojo listiter -scan -t pojo/
/-/
/-Iterator counter, scan scope options:/
/-iteratorPriority = 10/
/-iteratorClassName = iterators.Counter/
/-/
/root@dev pojo scan/
/root@dev pojo/

Best,

-Mike




On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
wilhelm.von.cl...@accumulo.net
mailto:wilhelm.von.cl...@accumulo.net wrote:

For a bit of psuedocode, I'd probably make a class that did
something akin to: http://pastebin.com/pKqAeeCR

I wrote that up real quick in a text editor-- it won't
compile or anything, but should point you in the right
direction.


On Mon, Jul 14, 2014 at 3:44 PM, William Slacum

Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values

2014-07-14 Thread William Slacum
Anything in your Tserver log? I think you should just rethrow that
IOExcepton on your source's next() method, since they're usually not
recoverable (ie, just make Counter#next throw IOException)


On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser josh.el...@gmail.com wrote:

 A quick sanity check is to make sure you have data in the table and that
 you can read the data without your iterator (I've thought I had a bug
 because I didn't have proper visibilities more times than I'd like to
 admit).

 Alternatively, you can also enable remote-debugging via Eclipse into the
 TabletServer which might help you understand more of what's going on.

 Lots of articles on how to set this up [1]. In short, add -Xdebug
 -Xrunjdwp:transport=dt_socket,server=y,address=8000 to
 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connect
 eclipse to 8000 via the Debug configuration menu, set a breakpoint in your
 init, seek and next methods, and `scan` in the shell.


 [1] http://javarevisited.blogspot.com/2011/02/how-to-setup-
 remote-debugging-in.html


 On 7/14/14, 5:33 PM, Michael Moss wrote:

 Hmm...Still doesn't return anything from the shell.

 http://pastebin.com/ndRhspf8

 Any thoughts? What's the best way to debug these?


 On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
 wilhelm.von.cl...@accumulo.net mailto:wilhelm.von.cl...@accumulo.net

 wrote:

 Ah, an artifact of me just willy nilly writing an iterator :) Any
 reference to `this.source` should be replaced with
 `this.getSource()`. In `next()`, your workaround ends up calling
 `this.hasTop()` as the while loop condition. It will always return
 false because two lines up we set `top_key` to null. We need to make
 sure that the source iterator has a top, because we want to read
 data from it. We'll have to change the loop condition to
 `while(this.getSource().hasTop())`. On line 38 of your code we'll
 need to call `this.getSource().next()` instead of `this.next()`.

 The iterator interface is documented, but there hasn't been a
 definitive go-to for making one. I've been drafting a blog post, but
 since it doesn't exist yet, hopefully the following will suffice.

 The lifetime of an iterator is (usually) as follows:

 (1) A new instance is called via Class.newInstance (so a no-args
 constructor is needed)
 (2) Init is called. This allows users to configure the iterator, set
 its source, and possible check the environment. We can also call
 `deepCopy` on the source if we want to have multiple sources (we'd
 do this if we wanted to do a merge read out of multiple column
 families within a row).
 (3) seek() is called. This gets our readers to the correct positions
 in the data that are within the scan range the user requested, as
 well as turning column families on or off. The name should
 reminiscent of seeking to some key on disk.
 (4) hasTop() is called. If true, that means we have data, and the
 iterator has a key/value pair that can be retrieved by calling
 getTopKey() and getTopValue(). If fasle, we're done because there's
 no data to return.
 (5) next() is called. This will attempt find a new top key and
 value. We go back to (4) to see if next was successful in finding a
 new top key/value and will repeat until the client is satisfied or
 hasTop() returns false.

 You can kind of make a state machine out of those steps where we
 loop between (4) and (5) until there's no data. There are more
 advanced workflows where next() can be reading from multiple
 sources, as well as seeking them to different positions in the tablet.


 On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
 michael.m...@gmail.com mailto:michael.m...@gmail.com wrote:

 Thanks, William. I was just hitting you up for an example :)

 I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but
 noticed that this.source in your example didn't have
 visibility. Did I worked around it correctly?

 When I add my iterator to my table and run scan from the shell,
 it returns nothing - what should I expect here? In general I've
 found the iterator interface pretty confusing and haven't spent
 the time wrapping my head around it yet. Any documentation or
 examples (beyond what I could find on the site or in the code)
 appreciated!

 /root@dev table pojo/
 /root@dev pojo listiter -scan -t pojo/
 /-/
 /-Iterator counter, scan scope options:/
 /-iteratorPriority = 10/
 /-iteratorClassName = iterators.Counter/
 /-/
 /root@dev pojo scan/
 /root@dev pojo/


 Best,

 -Mike




 On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
 wilhelm.von.cl...@accumulo.net
 mailto:wilhelm.von.cl...@accumulo.net wrote:

 For a bit of psuedocode, I'd