> I keep thinking about inlining this stuff at flush/compaction time and
appending the sketch to an hfile. After the fact you could read the
sketches in the tail of the hfiles for some counts on a Region basis but it
wouldn't be row-based.

There should be an issue for this if not one already (I've heard it
mentioned before). It would be a very nice to have. Wasn't the sketch stuff
from Yahoo incubated? ... Yes: https://datasketches.apache.org/ ,
https://incubator.apache.org/clutch/datasketches.html . There's something
in the family to try, so to speak.

The row vs cell distinction is an important one. If you are looking to add
or use something provided by HBase, the view of the data will be cell
based. That might be what you need, it might not be. Table level statistics
(aggregated from region sketches as stack suggests) would roll up either
cells or rows so could work if that's the granularity you need.

If the ask is for row based statistics for Phoenix, this is a question
better asked on dev@phoenix.


On Thu, May 14, 2020 at 9:19 AM Stack <[email protected]> wrote:

> On Wed, May 13, 2020 at 10:38 PM Sukumar Maddineni
> <[email protected]> wrote:
>
> > Hello everyone,
> >
> > Is there any existing tool which we can use to understand the size of the
> > rows in a table.  Like we want to know what is p90, max row size of rows
> in
> > a given table to understand the usage pattern and see how much room we
> have
> > before having large rows.
> >
> > I was thinking similar to RowCounter with reducer to consolidate the
> info.
> >
> >
> I've had some success scanning rows on a per-Region basis dumping a report
> per Region. I was passing the per row Results via something like the below:
>
>    static void processRowResult(Result result, Sketches sketches) {
>      // System.out.println(result.toString());
>      long rowSize = 0;
>      int columnCount = 0;
>      for (Cell cell : result.rawCells()) {
>        rowSize += estimatedSizeOfCell(cell);
>        columnCount += 1;
>      }
>      sketches.rowSizeSketch.update(rowSize);
>      sketches.columnCountSketch.update(columnCount);
>    }
>
> ... where the sketches are variants of
> com.yahoo.sketches.quantiles.*Sketch. The latter are nice in that the
> sketches can be aggregated so you can after-the-fact make table sketches by
> summing all of the Region sketches. I had a 100 quantiles so could do 95%
> or 96%, etc. The bins to use for say data size take a bit of tuning but can
> make a decent guess for first go round and see how you do.
>
> I keep thinking about inlining this stuff at flush/compaction time and
> appending the sketch to an hfile. After the fact you could read the
> sketches in the tail of the hfiles for some counts on a Region basis but it
> wouldn't be row-based. For row-based, you'd have to read Rows (hfiles are
> buckets of Cells, not rows).
>
> S
>
>
>
> >
> > --
> > Sukumar
> >
> > <https://smart.salesforce.com/sig/smaddineni//us_mb/default/link.html>
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Reply via email to