HBase doesn't care about the cross HFile "row" concept in the same way that Phoenix does.
As discussed earlier in this thread, during compaction we would call the sketch update function while processing cells in the HFile, and store the result into the HFile trailer. That's it. For Phoenix, because it cares about rows, something has to collect all cells for a given row across all files in the store, and hand the complete row to the sketch update function, and storing sketches in HFiles no longer makes sense, because the sketch is of data that lives in rows that span multiple HFiles. So you should take this to dev@phoenix, probably. On Thu, May 14, 2020 at 10:35 AM Sukumar Maddineni <[email protected]> wrote: > Hi Stack, > > Thanks for that pointer, I am not aware of sketches(one more concept to > learn :)). I will explore and see if this helps. > > Hi Andrew, > > Yes, this is needed for a Phoenix table but there are two asks. one is from > customer side who wants to know the size of their actual rows which is > equal to the sum of the size of all columns latest version(there might > extra versions or delete markers which might not be something customer > interested since they don't read that data) and second ask is from service > owner point of view where we want to know the size of full row including > all cells, this is needed for internal operations like backups, migrations, > growth analysis, stats. If we have something at HBase level then coming up > with a similar one for Phoenix table seems to be not that of a big job(I > might be wrong). > > > Thanks > Sukumar > > > > On Thu, May 14, 2020 at 10:11 AM Andrew Purtell <[email protected]> > wrote: > > > > I keep thinking about inlining this stuff at flush/compaction time and > > appending the sketch to an hfile. After the fact you could read the > > sketches in the tail of the hfiles for some counts on a Region basis but > it > > wouldn't be row-based. > > > > There should be an issue for this if not one already (I've heard it > > mentioned before). It would be a very nice to have. Wasn't the sketch > stuff > > from Yahoo incubated? ... Yes: https://datasketches.apache.org/ , > > https://incubator.apache.org/clutch/datasketches.html . There's > something > > in the family to try, so to speak. > > > > The row vs cell distinction is an important one. If you are looking to > add > > or use something provided by HBase, the view of the data will be cell > > based. That might be what you need, it might not be. Table level > statistics > > (aggregated from region sketches as stack suggests) would roll up either > > cells or rows so could work if that's the granularity you need. > > > > If the ask is for row based statistics for Phoenix, this is a question > > better asked on dev@phoenix. > > > > > > On Thu, May 14, 2020 at 9:19 AM Stack <[email protected]> wrote: > > > > > On Wed, May 13, 2020 at 10:38 PM Sukumar Maddineni > > > <[email protected]> wrote: > > > > > > > Hello everyone, > > > > > > > > Is there any existing tool which we can use to understand the size of > > the > > > > rows in a table. Like we want to know what is p90, max row size of > > rows > > > in > > > > a given table to understand the usage pattern and see how much room > we > > > have > > > > before having large rows. > > > > > > > > I was thinking similar to RowCounter with reducer to consolidate the > > > info. > > > > > > > > > > > I've had some success scanning rows on a per-Region basis dumping a > > report > > > per Region. I was passing the per row Results via something like the > > below: > > > > > > static void processRowResult(Result result, Sketches sketches) { > > > // System.out.println(result.toString()); > > > long rowSize = 0; > > > int columnCount = 0; > > > for (Cell cell : result.rawCells()) { > > > rowSize += estimatedSizeOfCell(cell); > > > columnCount += 1; > > > } > > > sketches.rowSizeSketch.update(rowSize); > > > sketches.columnCountSketch.update(columnCount); > > > } > > > > > > ... where the sketches are variants of > > > com.yahoo.sketches.quantiles.*Sketch. The latter are nice in that the > > > sketches can be aggregated so you can after-the-fact make table > sketches > > by > > > summing all of the Region sketches. I had a 100 quantiles so could do > 95% > > > or 96%, etc. The bins to use for say data size take a bit of tuning but > > can > > > make a decent guess for first go round and see how you do. > > > > > > I keep thinking about inlining this stuff at flush/compaction time and > > > appending the sketch to an hfile. After the fact you could read the > > > sketches in the tail of the hfiles for some counts on a Region basis > but > > it > > > wouldn't be row-based. For row-based, you'd have to read Rows (hfiles > are > > > buckets of Cells, not rows). > > > > > > S > > > > > > > > > > > > > > > > > -- > > > > Sukumar > > > > > > > > < > https://smart.salesforce.com/sig/smaddineni//us_mb/default/link.html> > > > > > > > > > > > > > -- > > Best regards, > > Andrew > > > > Words like orphans lost among the crosstalk, meaning torn from truth's > > decrepit hands > > - A23, Crosstalk > > > > > -- > > <https://smart.salesforce.com/sig/smaddineni//us_mb/default/link.html> > -- Best regards, Andrew Words like orphans lost among the crosstalk, meaning torn from truth's decrepit hands - A23, Crosstalk
