On Thu, May 14, 2020 at 12:23 PM Andrew Purtell <[email protected]> wrote:
> HBase doesn't care about the cross HFile "row" concept in the same way that > Phoenix does. > > As discussed earlier in this thread, during compaction we would call the > sketch update function while processing cells in the HFile, and store the > result into the HFile trailer. That's it. > > Esteban opened this a good while back: https://issues.apache.org/jira/browse/HBASE-17756. I like his idea of dumping the stats with pretty printer per hfile. Let me give it a go... S > For Phoenix, because it cares about rows, something has to collect all > cells for a given row across all files in the store, and hand the complete > row to the sketch update function, and storing sketches in HFiles no longer > makes sense, because the sketch is of data that lives in rows that span > multiple HFiles. So you should take this to dev@phoenix, probably. > > > > > > > On Thu, May 14, 2020 at 10:35 AM Sukumar Maddineni > <[email protected]> wrote: > > > Hi Stack, > > > > Thanks for that pointer, I am not aware of sketches(one more concept to > > learn :)). I will explore and see if this helps. > > > > Hi Andrew, > > > > Yes, this is needed for a Phoenix table but there are two asks. one is > from > > customer side who wants to know the size of their actual rows which is > > equal to the sum of the size of all columns latest version(there might > > extra versions or delete markers which might not be something customer > > interested since they don't read that data) and second ask is from > service > > owner point of view where we want to know the size of full row including > > all cells, this is needed for internal operations like backups, > migrations, > > growth analysis, stats. If we have something at HBase level then coming > up > > with a similar one for Phoenix table seems to be not that of a big job(I > > might be wrong). > > > > > > Thanks > > Sukumar > > > > > > > > On Thu, May 14, 2020 at 10:11 AM Andrew Purtell <[email protected]> > > wrote: > > > > > > I keep thinking about inlining this stuff at flush/compaction time > and > > > appending the sketch to an hfile. After the fact you could read the > > > sketches in the tail of the hfiles for some counts on a Region basis > but > > it > > > wouldn't be row-based. > > > > > > There should be an issue for this if not one already (I've heard it > > > mentioned before). It would be a very nice to have. Wasn't the sketch > > stuff > > > from Yahoo incubated? ... Yes: https://datasketches.apache.org/ , > > > https://incubator.apache.org/clutch/datasketches.html . There's > > something > > > in the family to try, so to speak. > > > > > > The row vs cell distinction is an important one. If you are looking to > > add > > > or use something provided by HBase, the view of the data will be cell > > > based. That might be what you need, it might not be. Table level > > statistics > > > (aggregated from region sketches as stack suggests) would roll up > either > > > cells or rows so could work if that's the granularity you need. > > > > > > If the ask is for row based statistics for Phoenix, this is a question > > > better asked on dev@phoenix. > > > > > > > > > On Thu, May 14, 2020 at 9:19 AM Stack <[email protected]> wrote: > > > > > > > On Wed, May 13, 2020 at 10:38 PM Sukumar Maddineni > > > > <[email protected]> wrote: > > > > > > > > > Hello everyone, > > > > > > > > > > Is there any existing tool which we can use to understand the size > of > > > the > > > > > rows in a table. Like we want to know what is p90, max row size of > > > rows > > > > in > > > > > a given table to understand the usage pattern and see how much room > > we > > > > have > > > > > before having large rows. > > > > > > > > > > I was thinking similar to RowCounter with reducer to consolidate > the > > > > info. > > > > > > > > > > > > > > I've had some success scanning rows on a per-Region basis dumping a > > > report > > > > per Region. I was passing the per row Results via something like the > > > below: > > > > > > > > static void processRowResult(Result result, Sketches sketches) { > > > > // System.out.println(result.toString()); > > > > long rowSize = 0; > > > > int columnCount = 0; > > > > for (Cell cell : result.rawCells()) { > > > > rowSize += estimatedSizeOfCell(cell); > > > > columnCount += 1; > > > > } > > > > sketches.rowSizeSketch.update(rowSize); > > > > sketches.columnCountSketch.update(columnCount); > > > > } > > > > > > > > ... where the sketches are variants of > > > > com.yahoo.sketches.quantiles.*Sketch. The latter are nice in that the > > > > sketches can be aggregated so you can after-the-fact make table > > sketches > > > by > > > > summing all of the Region sketches. I had a 100 quantiles so could do > > 95% > > > > or 96%, etc. The bins to use for say data size take a bit of tuning > but > > > can > > > > make a decent guess for first go round and see how you do. > > > > > > > > I keep thinking about inlining this stuff at flush/compaction time > and > > > > appending the sketch to an hfile. After the fact you could read the > > > > sketches in the tail of the hfiles for some counts on a Region basis > > but > > > it > > > > wouldn't be row-based. For row-based, you'd have to read Rows (hfiles > > are > > > > buckets of Cells, not rows). > > > > > > > > S > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sukumar > > > > > > > > > > < > > https://smart.salesforce.com/sig/smaddineni//us_mb/default/link.html> > > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Andrew > > > > > > Words like orphans lost among the crosstalk, meaning torn from truth's > > > decrepit hands > > > - A23, Crosstalk > > > > > > > > > -- > > > > <https://smart.salesforce.com/sig/smaddineni//us_mb/default/link.html> > > > > > -- > Best regards, > Andrew > > Words like orphans lost among the crosstalk, meaning torn from truth's > decrepit hands > - A23, Crosstalk >
