On Thu, May 14, 2020 at 12:23 PM Andrew Purtell <[email protected]> wrote:

> HBase doesn't care about the cross HFile "row" concept in the same way that
> Phoenix does.
>
> As discussed earlier in this thread, during compaction we would call the
> sketch update function while processing cells in the HFile, and store the
> result into the HFile trailer. That's it.
>
>
Esteban opened this a good while back:
https://issues.apache.org/jira/browse/HBASE-17756. I like his idea of
dumping the stats with pretty printer per hfile. Let me give it a go...
S




> For Phoenix, because it cares about rows, something has to collect all
> cells for a given row across all files in the store, and hand the complete
> row to the sketch update function, and storing sketches in HFiles no longer
> makes sense, because the sketch is of data that lives in rows that span
> multiple HFiles. So you should take this to dev@phoenix, probably.
>
>
>
>
>
>
> On Thu, May 14, 2020 at 10:35 AM Sukumar Maddineni
> <[email protected]> wrote:
>
> > Hi Stack,
> >
> > Thanks for that pointer, I am not aware of sketches(one more concept to
> > learn :)). I will explore and see if this helps.
> >
> > Hi Andrew,
> >
> > Yes, this is needed for a Phoenix table but there are two asks. one is
> from
> > customer side who wants to know the size of their actual rows which is
> > equal to the sum of the size of all columns latest version(there might
> > extra versions or delete markers which might not be something customer
> > interested since they don't read that data) and second ask is from
> service
> > owner point of view where we want to know the size of full row including
> > all cells, this is needed for internal operations like backups,
> migrations,
> > growth analysis, stats.  If we have something at HBase level then coming
> up
> > with a similar one for Phoenix table seems to be not that of a big job(I
> > might be wrong).
> >
> >
> > Thanks
> > Sukumar
> >
> >
> >
> > On Thu, May 14, 2020 at 10:11 AM Andrew Purtell <[email protected]>
> > wrote:
> >
> > > > I keep thinking about inlining this stuff at flush/compaction time
> and
> > > appending the sketch to an hfile. After the fact you could read the
> > > sketches in the tail of the hfiles for some counts on a Region basis
> but
> > it
> > > wouldn't be row-based.
> > >
> > > There should be an issue for this if not one already (I've heard it
> > > mentioned before). It would be a very nice to have. Wasn't the sketch
> > stuff
> > > from Yahoo incubated? ... Yes: https://datasketches.apache.org/ ,
> > > https://incubator.apache.org/clutch/datasketches.html . There's
> > something
> > > in the family to try, so to speak.
> > >
> > > The row vs cell distinction is an important one. If you are looking to
> > add
> > > or use something provided by HBase, the view of the data will be cell
> > > based. That might be what you need, it might not be. Table level
> > statistics
> > > (aggregated from region sketches as stack suggests) would roll up
> either
> > > cells or rows so could work if that's the granularity you need.
> > >
> > > If the ask is for row based statistics for Phoenix, this is a question
> > > better asked on dev@phoenix.
> > >
> > >
> > > On Thu, May 14, 2020 at 9:19 AM Stack <[email protected]> wrote:
> > >
> > > > On Wed, May 13, 2020 at 10:38 PM Sukumar Maddineni
> > > > <[email protected]> wrote:
> > > >
> > > > > Hello everyone,
> > > > >
> > > > > Is there any existing tool which we can use to understand the size
> of
> > > the
> > > > > rows in a table.  Like we want to know what is p90, max row size of
> > > rows
> > > > in
> > > > > a given table to understand the usage pattern and see how much room
> > we
> > > > have
> > > > > before having large rows.
> > > > >
> > > > > I was thinking similar to RowCounter with reducer to consolidate
> the
> > > > info.
> > > > >
> > > > >
> > > > I've had some success scanning rows on a per-Region basis dumping a
> > > report
> > > > per Region. I was passing the per row Results via something like the
> > > below:
> > > >
> > > >    static void processRowResult(Result result, Sketches sketches) {
> > > >      // System.out.println(result.toString());
> > > >      long rowSize = 0;
> > > >      int columnCount = 0;
> > > >      for (Cell cell : result.rawCells()) {
> > > >        rowSize += estimatedSizeOfCell(cell);
> > > >        columnCount += 1;
> > > >      }
> > > >      sketches.rowSizeSketch.update(rowSize);
> > > >      sketches.columnCountSketch.update(columnCount);
> > > >    }
> > > >
> > > > ... where the sketches are variants of
> > > > com.yahoo.sketches.quantiles.*Sketch. The latter are nice in that the
> > > > sketches can be aggregated so you can after-the-fact make table
> > sketches
> > > by
> > > > summing all of the Region sketches. I had a 100 quantiles so could do
> > 95%
> > > > or 96%, etc. The bins to use for say data size take a bit of tuning
> but
> > > can
> > > > make a decent guess for first go round and see how you do.
> > > >
> > > > I keep thinking about inlining this stuff at flush/compaction time
> and
> > > > appending the sketch to an hfile. After the fact you could read the
> > > > sketches in the tail of the hfiles for some counts on a Region basis
> > but
> > > it
> > > > wouldn't be row-based. For row-based, you'd have to read Rows (hfiles
> > are
> > > > buckets of Cells, not rows).
> > > >
> > > > S
> > > >
> > > >
> > > >
> > > > >
> > > > > --
> > > > > Sukumar
> > > > >
> > > > > <
> > https://smart.salesforce.com/sig/smaddineni//us_mb/default/link.html>
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrew
> > >
> > > Words like orphans lost among the crosstalk, meaning torn from truth's
> > > decrepit hands
> > >    - A23, Crosstalk
> > >
> >
> >
> > --
> >
> > <https://smart.salesforce.com/sig/smaddineni//us_mb/default/link.html>
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>

Reply via email to