There's been some mention about a desire to rethink the Iterator interface as it has some deficiencies (notably the lack of a "cleanup" before the iterators are torn down), but no one has stated that they're actively working on this.

Getting better documentation wrt to convetions: let us know where the Accumulo documentation falls short (and give us patches to fix the documentation :D). Additionally, write up your own findings from problems that you've run into. It's the entire community (users specifically) that we need to help encourage to grow.

Even things as simple as "how do I count entries in an iterator" are big as you are now an "expert" on the subject :)

On 7/15/14, 12:17 PM, Michael Moss wrote:
That worked ;) - Thanks!

What a journey...

I like Accumulo's architecture and promise, but the difficulty in
querying it (lack of documentation, conventions) is a major concern and
I'd imagine has to have an impact on adoption. I'm curious if there have
been any conversations around changing the interface around iterators
which are still confusing to me. Let me know how I can help!


On Tue, Jul 15, 2014 at 12:03 PM, William Slacum
<wilhelm.von.cl...@accumulo.net <mailto:wilhelm.von.cl...@accumulo.net>>
wrote:

    Herp... serves me right for not setting up a proper test case.

    I think you need to override seek as well:

    @Override
    public void seek(...) throws IOException {
       super.seek(...);
       next();
    }

    I think I just realized the wrapping iterator could use some clean
    up, because this isn't obvious. Basically after the wrapping
    iterator's seek is called, it never calls the implementor's next()
    to actually set up the first top key and value.



    On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss
    <michael.m...@gmail.com <mailto:michael.m...@gmail.com>> wrote:

        I set up debugging and am rethrowing the exception. What's
        strange is it appears that despite the iterator instance being
        properly set to iterator.Counter (my implementation), my
        breakpoints aren't being hit, only in the parent classes
        (Wrapping Iterator) and (SortedKeyValueIterator).

        I have two rows in the table, when I scan with no iterator:
        2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted;
        user: root; action: scan; targetTable: pojo; authorizations:
        public,; range: (-inf,+inf); columns: []; iterators: [];
        iteratorOptions: {};
        2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess
        tid 10.0.2.15:45073 <http://10.0.2.15:45073> 8*2 entries* in
        0.01 secs, nbTimes = [7 7 7.00 1]

        When I scan with the iterator (0 entries?):
        2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted;
        user: root; action: scan; targetTable: pojo; authorizations:
        public,; range: (-inf,+inf); columns: []; iterators: [];
        iteratorOptions: {};
        2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess
        tid 10.0.2.15:44992 <http://10.0.2.15:44992> 8 *0 entries* in
        0.01 secs, nbTimes = [6 6 6.00 1]

        No exceptions otherwise. Really appreciate all the ongoing help.

        Best,

        -Mike


        On Mon, Jul 14, 2014 at 6:40 PM, William Slacum
        <wilhelm.von.cl...@accumulo.net
        <mailto:wilhelm.von.cl...@accumulo.net>> wrote:

            Anything in your Tserver log? I think you should just
            rethrow that IOExcepton on your source's next() method,
            since they're usually not recoverable (ie, just make
            Counter#next throw IOException)


            On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser
            <josh.el...@gmail.com <mailto:josh.el...@gmail.com>> wrote:

                A quick sanity check is to make sure you have data in
                the table and that you can read the data without your
                iterator (I've thought I had a bug because I didn't have
                proper visibilities more times than I'd like to admit).

                Alternatively, you can also enable remote-debugging via
                Eclipse into the TabletServer which might help you
                understand more of what's going on.

                Lots of articles on how to set this up [1]. In short,
                add -Xdebug
                -Xrunjdwp:transport=dt_socket,__server=y,address=8000 to
                ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the
                tserver, connect eclipse to 8000 via the Debug
                configuration menu, set a breakpoint in your init, seek
                and next methods, and `scan` in the shell.


                [1]
                
http://javarevisited.blogspot.__com/2011/02/how-to-setup-__remote-debugging-in.html
                
<http://javarevisited.blogspot.com/2011/02/how-to-setup-remote-debugging-in.html>


                On 7/14/14, 5:33 PM, Michael Moss wrote:

                    Hmm...Still doesn't return anything from the shell.

                    http://pastebin.com/ndRhspf8

                    Any thoughts? What's the best way to debug these?


                    On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
                    <wilhelm.von.cloud@accumulo.__net
                    <mailto:wilhelm.von.cl...@accumulo.net>
                    <mailto:wilhelm.von.cloud@__accumulo.net
                    <mailto:wilhelm.von.cl...@accumulo.net>>>

                    wrote:

                         Ah, an artifact of me just willy nilly writing
                    an iterator :) Any
                         reference to `this.source` should be replaced with
                         `this.getSource()`. In `next()`, your
                    workaround ends up calling
                         `this.hasTop()` as the while loop condition. It
                    will always return
                         false because two lines up we set `top_key` to
                    null. We need to make
                         sure that the source iterator has a top,
                    because we want to read
                         data from it. We'll have to change the loop
                    condition to
                         `while(this.getSource().__hasTop())`. On line
                    38 of your code we'll
                         need to call `this.getSource().next()` instead
                    of `this.next()`.

                         The iterator interface is documented, but there
                    hasn't been a
                         definitive go-to for making one. I've been
                    drafting a blog post, but
                         since it doesn't exist yet, hopefully the
                    following will suffice.

                         The lifetime of an iterator is (usually) as
                    follows:

                         (1) A new instance is called via
                    Class.newInstance (so a no-args
                         constructor is needed)
                         (2) Init is called. This allows users to
                    configure the iterator, set
                         its source, and possible check the environment.
                    We can also call
                         `deepCopy` on the source if we want to have
                    multiple sources (we'd
                         do this if we wanted to do a merge read out of
                    multiple column
                         families within a row).
                         (3) seek() is called. This gets our readers to
                    the correct positions
                         in the data that are within the scan range the
                    user requested, as
                         well as turning column families on or off. The
                    name should
                         reminiscent of seeking to some key on disk.
                         (4) hasTop() is called. If true, that means we
                    have data, and the
                         iterator has a key/value pair that can be
                    retrieved by calling
                         getTopKey() and getTopValue(). If fasle, we're
                    done because there's
                         no data to return.
                         (5) next() is called. This will attempt find a
                    new top key and
                         value. We go back to (4) to see if next was
                    successful in finding a
                         new top key/value and will repeat until the
                    client is satisfied or
                         hasTop() returns false.

                         You can kind of make a state machine out of
                    those steps where we
                         loop between (4) and (5) until there's no data.
                    There are more
                         advanced workflows where next() can be reading
                    from multiple
                         sources, as well as seeking them to different
                    positions in the tablet.


                         On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
                         <michael.m...@gmail.com
                    <mailto:michael.m...@gmail.com>
                    <mailto:michael.m...@gmail.com
                    <mailto:michael.m...@gmail.com>__>> wrote:

                             Thanks, William. I was just hitting you up
                    for an example :)

                             I adapted your pseudocode
                    (http://pastebin.com/ufPJq0g3)__, but
                             noticed that "this.source" in your example
                    didn't have
                             visibility. Did I worked around it correctly?

                             When I add my iterator to my table and run
                    scan from the shell,
                             it returns nothing - what should I expect
                    here? In general I've
                             found the iterator interface pretty
                    confusing and haven't spent
                             the time wrapping my head around it yet.
                    Any documentation or
                             examples (beyond what I could find on the
                    site or in the code)
                             appreciated!

                             /root@dev> table pojo/
                             /root@dev pojo> listiter -scan -t pojo/
                             /-/
                             /-    Iterator counter, scan scope options:/
                             /-        iteratorPriority = 10/
                             /-        iteratorClassName =
                    iterators.Counter/
                             /-/
                             /root@dev pojo> scan/
                             /root@dev pojo>/


                             Best,

                             -Mike




                             On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
                             <wilhelm.von.cloud@accumulo.__net
                    <mailto:wilhelm.von.cl...@accumulo.net>
                             <mailto:wilhelm.von.cloud@__accumulo.net
                    <mailto:wilhelm.von.cl...@accumulo.net>>> wrote:

                                 For a bit of psuedocode, I'd probably
                    make a class that did
                                 something akin to:
                    http://pastebin.com/pKqAeeCR

                                 I wrote that up real quick in a text
                    editor-- it won't
                                 compile or anything, but should point
                    you in the right
                                 direction.


                                 On Mon, Jul 14, 2014 at 3:44 PM,
                    William Slacum
                                 <wilhelm.von.cloud@accumulo.__net
                    <mailto:wilhelm.von.cl...@accumulo.net>

                    <mailto:wilhelm.von.cloud@__accumulo.net
                    <mailto:wilhelm.von.cl...@accumulo.net>>> wrote:

                                     Hi Mike!

                                     The Combiner interface is only for
                    aggregating keys
                                     within a single row. You can
                    probably get away with
                                     implementing your combining logic
                    in a WrappingIterator
                                     that reads across all the rows in a
                    given tablet.

                                     To do some combine/fold/reduce
                    operation, Accumulo needs
                                     the input type to be the same as
                    the output type. The
                                     combiner doesn't have a notion of a
                    "present" type (as
                                     you'd see in something like
                    Algebird's Groups), but you
                                     can use another iterator to perform
                    your transformation.

                                     If you wanted to extract the
                    "count" field from your
                                     Avro object, you could write a new
                    Iterator that took
                                     your Avro object, extracted the
                    desired field, and
                                     returned it as its top value. You
                    can then set this
                                     iterator as the source of the
                    aggregator, either
                                     programmatically or via by wrapping
                    the source object
                                     passed to the aggregator in its
                                     SortedKeyValueIterator#init call.

                                     This is a bit inefficient as you'd
                    have to serialize to
                                     a Value and then immediately
                    deserialize it in the
                                     iterator above it. You could
                    mitigate this by exposing a
                                     method that would get the extracted
                    value before
                                     serializing it.

                                     This kind of counting also requires
                    client side logic to
                                     do a final combine operation, since
                    the aggregations
                                     from all the tservers are partial
                    results.

                                     I believe that CountingIterator is
                    not meant for user
                                     consumption, but I do not know if
                    it's related to your
                                     issue in trying to use it from the
                    shell. Iterators set
                                     through the shell, in previous
                    versions of Accumulo,
                                     have a requirement to implement
                    OptionDescriber. Many
                                     default iterators do not implement
                    this, and thus can't
                                     set in the shell.



                                     On Mon, Jul 14, 2014 at 2:44 PM,
                    Michael Moss
                                     <michael.m...@gmail.com
                    <mailto:michael.m...@gmail.com>
                    <mailto:michael.m...@gmail.com
                    <mailto:michael.m...@gmail.com>__>>

                                     wrote:

                                         Hi, All.

                                         I'm curious what the best
                    practices are around
                                         persisting complex types/data
                    in Accumulo (and
                                         aggregating on fields within them).

                                         Let's say I have (row, column
                    family, column
                                         qualifier, value):
                                         "A" "foo" ""
                    MyHugeAvroObject(count=2)
                                         "A" "foo" ""
                    MyHugeAvroObject(count=3)

                                         Let's say MyHugeAvroObject has
                    a field "Integer
                                         count" with the values above.

                                         What is the best way to
                    aggregate on row, column
                                         family, column qualifier by
                    count? In my above example:
                                         "A" "foo" "" 5

                                         The
                    TypedValueCombiner.typedReduce method can
                                         deserialize any "V", in my case
                    MyHugeAvroObject,
                                         but it needs to return a value
                    of type "V". What are
                                         the best practices for deeply
                    nested/complex
                                         objects? It's not always
                    straightforward to map a
                                         complex Avro type into Row ->
                    Column Family ->
                                         Column Qualifier.

                                         Rather than using a
                    TypedCombiner, I looked into
                                         using an Aggregator (which
                    appears deprecated as of
                                         1.4), which appears to let me
                    return arbitrary
                                         values, but despite running
                    setiter, my aggregator
                                         doesn't seem to do anything.

                                         I also tried looking at
                    implementing a
                                         WrappingIterator, which also
                    appears to allow me to
                                         return arbitary values (such as
                    Accumulo's
                                         CountingIterator), but I get
                    cryptic errors when
                                         trying to setiter, I'm on
                    Accumulo 1.6:

                                         root@dev kyt> setiter -t kyt
                    -scan -p 10 -n
                                         countingIter -class

                    
org.apache.accumulo.core.__iterators.system.__CountingIterator
                                         2014-07-14 11:12:55,623
                    [shell.Shell] ERROR:

                    java.lang.__IllegalArgumentException:

                    
org.apache.accumulo.core.__iterators.system.__CountingIterator

                                         This is odd because other
                    included implementations
                                         of WrappingIterator seem to
                    work (perhaps the
                                         implementation of
                    CountingIterator is dated):
                                         root@dev kyt> setiter -t kyt
                    -scan -p 10 -n
                                         deletingIterator -class

                    
org.apache.accumulo.core.__iterators.system.__DeletingIterator
                                         The iterator class does not
                    implement
                                         OptionDescriber. Consider this
                    for better iterator
                                         configuration using this
                    setiter command.
                                         Name for iterator (enter to skip):

                                         All in all, how can I aggregate
                    simple values, like
                                         counters from rows with complex
                    Avro objects as
                                         Values without having to add
                    aggregations fields to
                                         these Value objects?

                                         Thanks!

                                         -Mike










Reply via email to