Re: -warmup

2023-08-30 Thread Werner LEMBERG

> Something else: I think that the 'TOTAL' line doesn't make sense
> right now.  Please separate this line slightly from the rest of the
> table and print the *cumulated timing* (in 's', not 'µs') of all
> tests, something like
> 
>   Total duration for all tests: 25.3s
> 
> and
> 
>   Total duration of tests for 'Roboto_subset.ttf: 3.4s

I've just noticed other minor issues with the HTML page:

* The links for `Baseline` and `Benchmark` to the `.txt` files are
  absolute and thus not portable.  They must be relative.

* I wouldn't use '**' – this looks like a footnote, and people start
  searching where it belongs to.  Instead, please use a simple
  itemization.

* I suggest to rename column 'N' to 'Iterations' or '#Iterations' (or
  something similar).

* The sentence about '(X | Y') is cryptic.  Maybe simply start with

If two values are given in the 'N' column, ...

  and please add an explanation *why* there are two values at all.


Werner


Re: -warmup

2023-08-29 Thread Werner LEMBERG

>> I still think that for such cases the number of iterations of the
>> affected tests should be increased to get more precise values.
>
> the times are for single iteration. (chunk median/chunk size)

Yes, but the number of iterations is the same regardless whether a
test takes 10µs or 1000µs – I suspect that it is necessary to either
increase the number of iterations for the former case or to do bulk
tests.  With 'bulk test' I mean that you don't time single iterations
but do timings for 10 iterations in a group, say, thus avoiding issues
with the granularity of the OS timing functions.

I refuse to believe that we have to live with timing differences of
more than 50%.

BTW, have you checked whether replacing `CLOCK_REALTIME` with
`CLOCK_MONOTONIC` gives better results?

What platform do you actually develop on?  What #if clause in
function `get_time` of `ftbench.c` is run on your computer?


Werner


Re: -warmup

2023-08-29 Thread Ahmet Göksu
> I still think that for such cases the number of iterations of the affected 
> tests should be increased to get more precise values.
the times are for single iteration. (chunk median/chunk size)
> Please separate this line slightly from the rest of the table
> and print the *cumulated timing* (in 's', not 'µs') of all tests,
> something like
i will asap!

Best,
Goksu
goksu.in
On 29 Aug 2023 22:09 +0300, Werner LEMBERG , wrote:
>
> > here is the results with chris’ suggestion. (thanks chris)
>
> Much better, thanks!
>
> > still a bit noise on only load and load_advances. are results
> > acceptable?
>
> As far as I can see, the biggest differences occur if the 'Baseline'
> and 'Benchmark' columns contain very small values. I still think that
> for such cases the number of iterations of the affected tests should
> be increased to get more precise values. Please try that.
>
> Something else: I think that the 'TOTAL' line doesn't make sense right
> now. Please separate this line slightly from the rest of the table
> and print the *cumulated timing* (in 's', not 'µs') of all tests,
> something like
>
> Total duration for all tests: 25.3s
>
> and
>
> Total duration of tests for 'Roboto_subset.ttf: 3.4s
>
>
> Werner


Re: -warmup

2023-08-29 Thread Werner LEMBERG

> here is the results with chris’ suggestion. (thanks chris)

Much better, thanks!

> still a bit noise on only load and load_advances.  are results
> acceptable?

As far as I can see, the biggest differences occur if the 'Baseline'
and 'Benchmark' columns contain very small values.  I still think that
for such cases the number of iterations of the affected tests should
be increased to get more precise values.  Please try that.

Something else: I think that the 'TOTAL' line doesn't make sense right
now.  Please separate this line slightly from the rest of the table
and print the *cumulated timing* (in 's', not 'µs') of all tests,
something like

  Total duration for all tests: 25.3s

and

  Total duration of tests for 'Roboto_subset.ttf: 3.4s


 Werner


Re: -warmup

2023-08-29 Thread Ahmet Göksu
hi,
here is the results with chris’ suggestion. (thanks chris)
i will check hyperfine.
still a bit noise on only load and load_advances.
are results acceptable?

Best,
Goksu
goksu.in
On 28 Aug 2023 21:19 +0300, Werner LEMBERG , wrote:
>
> code



Freetype Benchmark Results
Warning: Baseline and Benchmark have the same commit ID!
Info

InfoBaselineBenchmark
Parameters-c 750 -w 50-c 750 -w 50
Commit IDf3dfede6f3dfede6
Commit Date2023-08-18 17:42:53 +03002023-08-18 17:42:53 +0300
BranchGSoC-2023-AhmetGSoC-2023-Ahmet
* Average time for single iteration. Smaller values are better.** N count in (x | y) format is for showing baseline and benchmark N counts seperately when they differs.Total Results

TestNBaseline (µs)Benchmark (µs)Difference (%)
Load3750003535.93622.3-2.4
Load_Advances (Normal)3750003057.83133.8-2.5
Load_Advances (Fast)37500010.912.5-14.7
Load_Advances (Unscaled)37500010.010.6-6.0
Render3750004772.24796.7-0.5
Get_Glyph3750003418.73415.20.1
Get_Char_Index3525009.19.10.0
Iterate CMap37506.56.50.0
New_Face3750241.5239.90.7
Embolden3750003798.93780.30.5
Stroke292750 | 28800018049.318067.9-0.1
Get_BBox3750002826.32764.22.2
Get_CBox3750003473.43421.91.5
New_Face & load glyph(s)375000555.6585.6-5.4
TOTAL4402750 | 439800043766.143866.5-0.2

Results for Roboto_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load9583.3575.11.4
Load_Advances (Normal)9487.7503.3-3.2
Load_Advances (Fast)92.64.1-57.7
Load_Advances (Unscaled)92.42.40.0
Render9934.1941.8-0.8
Get_Glyph9582.7597.7-2.6
Get_Char_Index705001.81.80.0
Iterate CMap7501.31.30.0
New_Face75042.541.42.6
Embolden9642.9655.9-2.0
Stroke64008.74043.7-0.9
Get_BBox9167.6139.316.9
Get_CBox9592.1602.3-1.7
New_Face & load glyph(s)9105.0105.00.0
TOTAL20640008154.78215.10.7

Results for Arial_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load71250812.8934.1-14.9
Load_Advances (Normal)71250733.1793.3-8.2
Load_Advances (Fast)712502.02.1-5.0
Load_Advances (Unscaled)712501.91.90.0
Render712501083.01067.41.4
Get_Glyph71250791.0789.80.2
Get_Char_Index705001.81.80.0
Iterate CMap7501.31.30.0
New_Face75051.850.03.5
Embolden71250874.9855.02.3
Stroke570003480.93439.51.2
Get_BBox71250761.9725.24.8
Get_CBox71250815.3787.53.4
New_Face & load glyph(s)71250110.7109.31.3
TOTAL16830009522.49558.20.4

Results for TimesNewRoman_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load71250973.9944.33.0
Load_Advances (Normal)71250876.2859.02.0
Load_Advances (Fast)712502.12.10.0
Load_Advances (Unscaled)712501.91.90.0
Render712501202.71191.11.0
Get_Glyph71250945.6922.52.4
Get_Char_Index705001.81.80.0
Iterate CMap7501.31.30.0
New_Face75053.654.1-0.9
Embolden712501059.51037.22.1
Stroke427504599.44579.50.4
Get_BBox71250895.5887.50.9
Get_CBox71250966.3917.15.1
New_Face & load glyph(s)71250135.2168.5-24.6
TOTAL165450011715.011567.9-1.3

Results for Tahoma_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load71250572.0570.30.3
Load_Advances (Normal)71250469.6525.8-12.0
Load_Advances (Fast)712502.12.10.0
Load_Advances (Unscaled)712501.92.5-31.6
Render71250785.6801.5-2.0
Get_Glyph71250560.0564.4-0.8
Get_Char_Index705001.81.80.0
Iterate CMap7501.31.30.0
New_Face75047.147.10.0

Re: -warmup

2023-08-28 Thread Werner LEMBERG

>> Should I proceed to detect outliers?  Since we do not get the same
>> error rate consistently, I think we will not find the target we
>> expected by outliers.
> 
> Why do you think so?  Please explain your reasoning.  Just remember
> that backup processes (like cleaning up the hard disk, running some
> cron jobs, etc.) can pop up anytime, thus influencing the result.
> Such spontaneous events have to be eliminated.
> 
> Have you actually tried something along the method I suggested?

BTW, here is another repository that provides a framework for
benchmarks:

  https://github.com/sharkdp/hyperfine

It looks quite nice, AFAICS – maybe you can check its code for
identifying outliers and the like.


Werner


Re: -warmup

2023-08-21 Thread Werner LEMBERG

Ahmet,


> I have edited the code aligning with the Hin-Tak’s suggestion. Here
> is the two results pages, also pushed on gitlab.

Thanks.  It seems to me we are getting nearer.  However, there are
still large differences.

* Chris mentioned a potential problem with `clock_gettime` in the
  code of `ftbench.c`.  Please have a look.

* As mentioned a few times already in previous e-mails I think we need
  some code to increase the run time for individual tests.  For
  example, the line

  ```
  Load_Advances (Fast)   47500   202   284   -40.6
  ```

  indicates that 47500 iterations only took 202µs vs. 284µs – due to
  the 'CPU noise' I think this interval is far too short to be
  meaningful.

  For example, if the cumulative time for test X is less than a
  certain threshold, increase N so that the cumulative time is very
  near to the threshold.  This value, if performed by the 'baseline',
  should be stored in a configuration file so that the 'benchmark'
  stage can extract this information to use exactly the same N value
  for test X.

  Please work on that.

* It would be great if you could use a statistics program of your
  choice and prepare some diagrams of the most problematic cases that
  show the actual timing distributions graphically.  Right now, we
  only see the final cumulative value; however, it would be most
  interesting to see more details of the timing slots.

  For example, I can imagine that you add some `printf` calls
  (printing to a buffer, which gets dumped to stdout or whatever
  *after* the tests); then GNU plot or something similar can generate
  diagrams.


Werner


Re: -warmup

2023-08-21 Thread Werner LEMBERG


>> To summarize: Benchmark comparisons only work if there is a sound
>> mathematical foundation to reduce the noise.
> 
> I am probably not qualified, but I am following the discussion for
> some time.  And I think there is a problem with the benchmarking
> itself.  If I understand correctly the nice tables show the same
> code on the same machine so 40% difference or so is not ok.
> 
> I had a quick look at ftbench.c and I have the impression that the
> timer ist using by clock_gettime for every single iteration twice.
> I had expected to do N iterations with a single clock_gettime before
> and after N iterations.  If the benchmarked code is short this will
> accumulate errors that cannot be removed afterwards.  But I may be
> wrong...

Alexei?


Werner



Re: -warmup

2023-08-18 Thread Ahmet Göksu
Hİ,
I have edited the code aligning with the Hin-Tak’s suggestion. Here is the two 
results pages, also pushed on gitlab.

Best,
Goksu
goksu.in
On 18 Aug 2023 14:02 +0300, Werner LEMBERG , wrote:
> > > What happens if you use, say, `-c 10', just running the
> > > `Get_Char_Index` test? Are the percental timing differences then
> > > still that large?
> > Actually Get_Char_Index, on the three pages I have sent in the
> > prev. mail, is higher than 6% only 4 times out of 15 total. (which is
> > seem on other tests as well).
>
> Well, the thing is that IMHO the difference should be *much* smaller –
> your HTML pages show the execution of identical code on an identical
> machine, right?
>
> > about outliers, i splitted every tests into chuncks that is sized
> > 100. Made IQR calculations and calculated average time on valid
> > chunks. you can find the result in the attachment also pushed to
> > gitlab.
>
> Thanks. Hin-Tak gave additional suggestions how to possibly improve
> the removal of outliers.
>
> > also, since statistics and benchmarking are a sciences their self, i
> > am a bit struggling while approaching the problem as well as feels
> > like out of the gsoc project scope.
>
> Indeed, the focus lately shifted from a representational aspect to a
> more thorough approach how to handle benchmarks in general. You are
> done with the first part, more or less, and it looks fine. The
> latter, however, is definitely part of the GSoC project, too, and I'm
> surprised that you think this might not be so: What are benchmark
> timings good for if the returned values are completely meaningless?
>
> In most cases, a small performance optimization in FreeType might
> yield, say, an improvement of 1%. Right now, such a change would not
> be detectable at all if using the framework you are working on – it
> would be completely hidden by noise.
>
> To summarize: Benchmark comparisons only work if there is a sound
> mathematical foundation to reduce the noise. I don't ask you to
> reinvent the wheel, but please do some more internet research and
> check existing code how to tackle such problems. I'm 100% sure that
> such code already exists (for example, the Google benchmark stuff
> mentioned in a previous e-mail, scientific papers on arXiv, etc.,
> etc.) and can be easily used, adapted, and simplified for our
> purposes.
>
>
> Werner



Freetype Benchmark Results
Warning: Baseline and Benchmark have the same commit ID!
Info

InfoBaselineBenchmark
Parameters-c 550 -w 50-c 550 -w 50
Commit ID3553148135531481
Commit Date2023-08-18 02:04:38 +03002023-08-18 02:04:38 +0300
BranchGSoC-2023-AhmetGSoC-2023-Ahmet
* Average time for all iterations. Smaller values are better.** N count in (x | y) format is for showing baseline and benchmark N counts seperately when they differs.Total Results

TestNBaseline (µs)Benchmark (µs)Difference (%)
Load25178180190899-7.1
Load_Advances (Normal)251616541594301.4
Load_Advances (Fast)2510471130-7.9
Load_Advances (Unscaled)259961006-1.0
Render252784102766440.6
Get_Glyph252135642093142.0
Get_Char_Index23500010129921.9
Iterate CMap2500873908-4.0
New_Face25001286213004-1.1
Embolden252290422261561.3
Stroke259561529556160.1
Get_BBox251809221771082.1
Get_CBox252118552114930.2
New_Face & load glyph(s)2529090287781.1
TOTAL299245565824524790.1

Results for Roboto_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load62996431568-5.3
Load_Advances (Normal)629942273428.7
Load_Advances (Fast)62332320.6
Load_Advances (Unscaled)62202200.0
Render65317953842-1.2
Get_Glyph63882039532-1.8
Get_Char_Index47000197198-0.3
Iterate CMap500185194-4.6
New_Face500230822671.8
Embolden64110941912-2.0
Stroke6213674213932-0.1
Get_BBox617216160356.9
Get_CBox63897640102-2.9
New_Face & load glyph(s)6556455540.2
TOTAL14160004715874729300.3

Results for Arial_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load475003783443165-14.1
Load_Advances (Normal)4750037650352156.5
Load_Advances (Fast)47500202284-40.6
Load_Advances (Unscaled)47500192198-3.1
Render47500   

Re: -warmup

2023-08-18 Thread chris
On Fri, 18 Aug 2023 11:02:49 + (UTC), Werner LEMBERG wrote:
> To summarize: Benchmark comparisons only work if there is a sound
> mathematical foundation to reduce the noise.

I am probably not qualified, but I am following the discussion for some 
time. And I think there is a problem with the benchmarking itself. If I 
understand correctly the nice tables show the same code on the same 
machine so 40% difference or so is not ok.

I had a quick look at ftbench.c and I have the impression that the 
timer ist using by clock_gettime for every single iteration twice. I 
had expected to do N iterations with a single clock_gettime before and 
after N iterations. If the benchmarked code is short this will 
accumulate errors that cannot be removed afterwards. But I may be 
wrong...

Greetings, chris



Re: -warmup

2023-08-18 Thread Werner LEMBERG
>> What happens if you use, say, `-c 10', just running the
>> `Get_Char_Index` test? Are the percental timing differences then
>> still that large?
> Actually Get_Char_Index, on the three pages I have sent in the
> prev. mail, is higher than 6% only 4 times out of 15 total. (which is
> seem on other tests as well).

Well, the thing is that IMHO the difference should be *much* smaller –
your HTML pages show the execution of identical code on an identical
machine, right?

> about outliers, i splitted every tests into chuncks that is sized
> 100.  Made IQR calculations and calculated average time on valid
> chunks.  you can find the result in the attachment also pushed to
> gitlab.

Thanks.  Hin-Tak gave additional suggestions how to possibly improve
the removal of outliers.

> also, since statistics and benchmarking are a sciences their self, i
> am a bit struggling while approaching the problem as well as feels
> like out of the gsoc project scope.

Indeed, the focus lately shifted from a representational aspect to a
more thorough approach how to handle benchmarks in general.  You are
done with the first part, more or less, and it looks fine.  The
latter, however, is definitely part of the GSoC project, too, and I'm
surprised that you think this might not be so: What are benchmark
timings good for if the returned values are completely meaningless?

In most cases, a small performance optimization in FreeType might
yield, say, an improvement of 1%.  Right now, such a change would not
be detectable at all if using the framework you are working on – it
would be completely hidden by noise.

To summarize: Benchmark comparisons only work if there is a sound
mathematical foundation to reduce the noise.  I don't ask you to
reinvent the wheel, but please do some more internet research and
check existing code how to tackle such problems.  I'm 100% sure that
such code already exists (for example, the Google benchmark stuff
mentioned in a previous e-mail, scientific papers on arXiv, etc.,
etc.) and can be easily used, adapted, and simplified for our
purposes.


Werner


Re: -warmup

2023-08-18 Thread Ahmet Göksu
Hi,
The approach we initially took was, in fact, based on the principle of the 
interquartile range (IQR) – a method that excludes outliers by determining the 
range between the first and third quartiles. However, I understand from your 
feedback that directly focusing on the median and quantiles offers a clearer 
representation. I will adapt the code aligning with your suggestion.

Best,
Goksu
goksu.in
On 18 Aug 2023 1:04 PM +0300, Hin-Tak Leung , 
wrote:
>
>
> On Friday, 18 August 2023 at 00:21:41 BST, Ahmet Göksu  wrote:
> >
> > about outliers, i splitted every tests into chuncks that is sized 100. Made 
> > IQR calculations and calculated average time on valid chunks. you can find 
> > the result in the attachment also pushed to gitlab.
>
> > also, since statistics and benchmarking are a sciences their self, i am a 
> > bit struggling while approaching the problem as well as feels like out of 
> > the gsoc project scope. I would like to share this with your indulgence. 
> > yet, of course I will move in accordance with your instructions.
>
> Hmm, this is lacking basic maths skills... cutting into chucks and 
> recombining them aren’t going to deal with outliners. Read about "median", 
> "quantile" on Wikipedia/Google'ing. Anyway, you want to calculate the 
> "median" time. E.g. sort 100 numbers by size, getting the average of 50th and 
> 51th, and your error is the difference between the 91th and the 10th 
> quantile. ( the 10th and the 91th when you sort them in order of size). If 
> you can do that for the entire set, do it for the whole set; if not, a 
> running median - ie. The median of every chuck of 100. Then combine the 
> running medians.
>
> This way, the top 9 and bottom 9 values of each 100 have no contribution at 
> all to your outcome. This is dealing with outliners.
>
>


Re: -warmup

2023-08-18 Thread Hin-Tak Leung
 

On Friday, 18 August 2023 at 00:21:41 BST, Ahmet Göksu  
wrote:  


> about outliers, i splitted every tests into chuncks that is sized 100. Made 
> IQR calculations and calculated average time on valid chunks. you can find 
> the result in the attachment also pushed to gitlab.

> also, since statistics and benchmarking are a sciences their self, i am a bit 
> struggling while approaching the problem as well as feels like out of the 
> gsoc project scope. I would like to share this with your indulgence. yet, of 
> course I will move in accordance with your instructions.
Hmm, this is lacking basic maths skills... cutting into chucks and recombining 
them aren’t going to deal with outliners. Read about "median", "quantile" on 
Wikipedia/Google'ing. Anyway, you want to calculate the "median" time. E.g. 
sort 100 numbers by size, getting the average of 50th and 51th, and your error 
is the difference between the 91th and the 10th quantile. ( the 10th and the 
91th when you sort them in order of size). If you can do that for the entire 
set, do it for the whole set; if not, a running median - ie. The median of 
every chuck of 100. Then combine the running medians.
This way, the top 9 and bottom 9 values of each 100 have no contribution at all 
to your outcome. This is dealing with outliners.

  

Re: -warmup

2023-08-17 Thread Ahmet Göksu
> What happens if you use, say, `-c 10', just running the
> `Get_Char_Index` test? Are the percental timing differences then
> still that large?
Actually Get_Char_Index, on the three pages I have sent in the prev. mail, is 
higher than 6% only 4 times out of 15 total. (which is seem on other tests as 
well).
> Why do you think so? Please explain your reasoning. Just remember
> that backup processes (like cleaning up the hard disk, running some
> cron jobs, etc.) can pop up anytime, thus influencing the result.
> Such spontaneous events have to be eliminated.
yes, right, didn't thought of the spontaneous events.

about outliers, i splitted every tests into chuncks that is sized 100. Made IQR 
calculations and calculated average time on valid chunks. you can find the 
result in the attachment also pushed to gitlab.

also, since statistics and benchmarking are a sciences their self, i am a bit 
struggling while approaching the problem as well as feels like out of the gsoc 
project scope. I would like to share this with your indulgence. yet, of course 
I will move in accordance with your instructions.

Best,
Goksu
goksu.in
On 18 Aug 2023 00:02 +0300, Werner LEMBERG , wrote:
>
> > I have added the total table that you suggested.
>
> Thanks.
>
> > I think Get_Char_Index is not the problem, the results varies all
> > the time.
>
> As far as I can see, there is a direct relationship between the total
> cumulated time of a test and the timing variation: The smaller the
> cumulated time, the larger the variation.
>
> What happens if you use, say, `-c 10', just running the
> `Get_Char_Index` test? Are the percental timing differences then
> still that large?
>
> > Should I proceed to detect outliers? Since we do not get the same
> > error rate consistently, I think we will not find the target we
> > expected by outliers.
>
> Why do you think so? Please explain your reasoning. Just remember
> that backup processes (like cleaning up the hard disk, running some
> cron jobs, etc.) can pop up anytime, thus influencing the result.
> Such spontaneous events have to be eliminated.
>
> Have you actually tried something along the method I suggested?
>
>
> Werner



Freetype Benchmark Results
Warning: Baseline and Benchmark have the same commit ID!
Info

InfoBaselineBenchmark
Parameters-c 550 -w 50-c 550 -w 50
Commit ID3553148135531481
Commit Date2023-08-18 02:04:38 +03002023-08-18 02:04:38 +0300
BranchGSoC-2023-AhmetGSoC-2023-Ahmet
* Average time for all iterations. Smaller values are better.** N count in (x | y) format is for showing baseline and benchmark N counts seperately when they differs.Total Results

TestNBaseline (µs)Benchmark (µs)Difference (%)
Load54029304021700.2
Load_Advances (Normal)5327190357270-9.2
Load_Advances (Fast)513881554-12.0
Load_Advances (Unscaled)5153413988.9
Render5199593213029-6.7
Get_Glyph58504488477-4.0
Get_Char_Index4700013161365-3.8
Iterate CMap50010231039-1.6
New_Face5002582728613-10.8
Embolden5120237125197-4.1
Stroke515491651561868-0.8
Get_BBox55394854843-1.7
Get_CBox53899339678-1.8
New_Face & load glyph(s)56126663766-4.1
TOTAL59800028694532940267-2.5

Results for Roboto_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load1200069078678471.8
Load_Advances (Normal)120005681764449-13.4
Load_Advances (Fast)12000311358-15.4
Load_Advances (Unscaled)12000290340-17.4
Render120004814651436-6.8
Get_Glyph120001976020782-5.2
Get_Char_Index9400284306-7.9
Iterate CMap1002402323.5
New_Face10045146317-40.0
Embolden120002471328423-15.0
Stroke120003632683526572.9
Get_BBox1200011871114883.2
Get_CBox12000947688186.9
New_Face & load glyph(s)1200011584110974.2
TOTAL2832006203506245510.7

Results for Arial_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load95009213494812-2.9
Load_Advances (Normal)95007716390204-16.9
Load_Advances (Fast)9500285370-29.9
Load_Advances (Unscaled)950033329112.8
Render95004099043614-6.4
Get_Glyph95001753117640-0.6
Get_Char_Index94002622494.8
Iterate CMap100

Re: -warmup

2023-08-17 Thread Werner LEMBERG


> Just remember that backup processes (like cleaning up the hard disk,
> running some cron jobs, etc.) can pop up anytime, thus influencing
> the result.

s/backup/background/



Re: -warmup

2023-08-17 Thread Werner LEMBERG


> I have added the total table that you suggested.

Thanks.

> I think Get_Char_Index is not the problem, the results varies all
> the time.

As far as I can see, there is a direct relationship between the total
cumulated time of a test and the timing variation: The smaller the
cumulated time, the larger the variation.

What happens if you use, say, `-c 10', just running the
`Get_Char_Index` test?  Are the percental timing differences then
still that large?

> Should I proceed to detect outliers?  Since we do not get the same
> error rate consistently, I think we will not find the target we
> expected by outliers.

Why do you think so?  Please explain your reasoning.  Just remember
that backup processes (like cleaning up the hard disk, running some
cron jobs, etc.) can pop up anytime, thus influencing the result.
Such spontaneous events have to be eliminated.

Have you actually tried something along the method I suggested?


Werner



Re: -warmup

2023-08-16 Thread Ahmet Göksu
Hi,
I have added the total table that you suggested.

I think Get_Char_Index is not the problem, the results varies all the time. 
Here are the three results that i had in the same minute (one has different 
flags).

Should I proceed to detect outliers?

Since we do not get the same error rate consistently,  I think we will not find 
the target we expected by outliers.

Best,
Goksu
goksu.in
On 7 Aug 2023 15:57 +0300, Werner LEMBERG , wrote:
>
> > > What exactly means 'Baseline (ms)'? Is the shown number the time
> > >  for one loop? For all loops together? Please clarify and mention
> > >  this on the HTML page.
> >
> > Clarified that the times are milliseconds for the cumulative time
> > for all iterations.
>
> Thanks. The sentence is not easily comprehensible. Perhaps change it
> to something like
>
> ```
> Cumulative time for all iterations. Smaller values means better.
> ```
>
> BTW, in column 'N' I see stuff like '68160 | 65880'. What does this
> mean? Please add an explanatory comment to the HTML page.
>
> Another thing: Please mention on the HTML page the completion time for
> each test, and the total execution time of all tests together.
>
> > > Looking at the 'Load_Advances (Unscaled)' row, I think that 100%
> > >  difference between 0.001 and 0.002 doesn't make any sense. How do
> > >  you compute the percentage? Is this based on the cumulative time
> > > of  all loops? If so, and you really get such small numbers, there
> > > must  be some fine-tuning for high-speed tests (for example,
> > > increasing N  for this particular test by a factor of 10, say) to
> > > get meaningful  timing values.
> >
> > it was cumulative time in milliseconds but converted it microseconds
> > as how it was and it seem got better.
>
> We are getting nearer, again :-)
>
> What worries me, though, is that we still have such enormous
> differences. For `Get_Char_Index` I think it's lack of precision.
> Please try to fix this – if the ratio
>
> cumulative_time / N
>
> is smaller than a given threshold, N must be increased a lot. In
> other words, for `Roboto_subset.ttf`, N should be set to, say, 10*N.
>
> For the other large differences I think we need some statistical
> analysis to get better results – simple cumulation is not good enough.
> In particular, outliers should be removed (at least this is my
> hypothesis). Maybe you can look up the internet to find some simple
> code to handle them.
>
> An idea to identify outliers could be to split the cumulation time
> into, say, 100 smaller intervals. You can the discard the too-large
> values and compute the mean of the remaining data. My reasoning is
> that other CPU activity happens in parallel, but only for short
> amounts of time.
>
> Have you actually done a statistical analysis of, say, 'Load_Advances
> (Normal)' for `Arial_subset.ttf`? For example, printing all timings
> of the datapoints as histograms for runs A and B? *Are* there
> outliers? Maybe there is another statistical mean value that gives
> more meaningful results.
>
>
> Werner



Freetype Benchmark Results
Warning: Baseline and Benchmark have the same commit ID!
Info

InfoBaselineBenchmark
Parameters-c 1000 -w 50-c 1000 -w 50
Commit ID4bcd97114bcd9711
Commit Date2023-08-07 15:11:28 +03002023-08-07 15:11:28 +0300
BranchGSoC-2023-AhmetGSoC-2023-Ahmet
* Cumulative time for all iterations. Smaller values are better.** N count in (x | y) format is for showing baseline and benchmark N counts seperately when they differs.Total Results

TestNBaseline (µs)Benchmark (µs)Difference (%)
Load5034823593519568-1.1
Load_Advances (Normal)5029222573129046-7.1
Load_Advances (Fast)501367115627-14.3
Load_Advances (Unscaled)501245915309-22.9
Render5018271231878593-2.8
Get_Glyph507681997543151.8
Get_Char_Index4713396127994.5
Iterate CMap5000986093784.9
New_Face50002487682405913.3
Embolden5011077861109611-0.2
Stroke294205 | 289365780418077762470.4
Get_BBox50491174496942-1.2
Get_CBox50355009355822-0.2
New_Face & load glyph(s)505711725470084.2
TOTAL5774205 | 57693651962741319860856-1.2

Results for Roboto_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load126119135767105.8
Load_Advances (Normal)12498485538696-8.1
Load_Advances (Fast)1229644976-67.9
Load_Advances (Unscaled)1227905476-96.3
Render12419155437405-4.4
Get_Glyph121738201661314.4
Get_Char_Index94000270126412.2
Iterate CMap1000   

Re: -warmup

2023-08-07 Thread Werner LEMBERG

>> What exactly means 'Baseline (ms)'? Is the shown number the time
>>  for one loop? For all loops together? Please clarify and mention
>>  this on the HTML page.
>
> Clarified that the times are milliseconds for the cumulative time
> for all iterations.

Thanks.  The sentence is not easily comprehensible.  Perhaps change it
to something like

```
Cumulative time for all iterations.  Smaller values means better.
```

BTW, in column 'N' I see stuff like '68160 | 65880'.  What does this
mean?  Please add an explanatory comment to the HTML page.

Another thing: Please mention on the HTML page the completion time for
each test, and the total execution time of all tests together.

>> Looking at the 'Load_Advances (Unscaled)' row, I think that 100%
>>  difference between 0.001 and 0.002 doesn't make any sense. How do
>>  you compute the percentage? Is this based on the cumulative time
>> of  all loops? If so, and you really get such small numbers, there
>> must  be some fine-tuning for high-speed tests (for example,
>> increasing N  for this particular test by a factor of 10, say) to
>> get meaningful  timing values.
>
> it was cumulative time in milliseconds but converted it microseconds
> as how it was and it seem got better.

We are getting nearer, again :-)

What worries me, though, is that we still have such enormous
differences.  For `Get_Char_Index` I think it's lack of precision.
Please try to fix this – if the ratio

   cumulative_time / N

is smaller than a given threshold, N must be increased a lot.  In
other words, for `Roboto_subset.ttf`, N should be set to, say, 10*N.

For the other large differences I think we need some statistical
analysis to get better results – simple cumulation is not good enough.
In particular, outliers should be removed (at least this is my
hypothesis).  Maybe you can look up the internet to find some simple
code to handle them.

An idea to identify outliers could be to split the cumulation time
into, say, 100 smaller intervals.  You can the discard the too-large
values and compute the mean of the remaining data.  My reasoning is
that other CPU activity happens in parallel, but only for short
amounts of time.

Have you actually done a statistical analysis of, say, 'Load_Advances
(Normal)' for `Arial_subset.ttf`?  For example, printing all timings
of the datapoints as histograms for runs A and B?  *Are* there
outliers?  Maybe there is another statistical mean value that gives
more meaningful results.


Werner


Re: -warmup

2023-08-07 Thread Ahmet Göksu
Hi!
I changed code to warmup with number of iterations.
> What exactly means 'Baseline (ms)'? Is the shown number the time
>  for one loop? For all loops together? Please clarify and mention
>  this on the HTML page.
Clarified that the times are milliseconds for the cumulative time for all 
iterations.
> There seems to be a fundamental math problem in calculating the
>  percentage numbers. For example, looking at the 'TOTAL' field, the
>  percental difference between 2.788 and 2.740 is not -6.1% but -1.7%!
it was average of the all percentages but you are right. I have changed it 
percentage of total time changes.
> Looking at the 'Load_Advances (Unscaled)' row, I think that 100%
>  difference between 0.001 and 0.002 doesn't make any sense. How do
>  you compute the percentage? Is this based on the cumulative time of
>  all loops? If so, and you really get such small numbers, there must
>  be some fine-tuning for high-speed tests (for example, increasing N
>  for this particular test by a factor of 10, say) to get meaningful
>  timing values.
it was cumulative time in milliseconds but converted it microseconds as how it 
was and it seem got better. If any fine-tuning needed since now, i will.

Looking for reply.


Best,
Goksu
goksu.in
On 3 Aug 2023 19:50 +0300, Werner LEMBERG , wrote:
>
> > It is warming up as the given number of seconds with -w flag before
> > every benchmark test.
> >
> > There are still differences like 100%.. Also, 1 sec warmup means
> > (test count)*(font count) 70 secs for the results.
>
> Mhmm, I'm not sure whether a warmup *time span* makes sense. I would
> rather have thought that every test would get a certain number of
> warmup *loops*. For example, '--warmup 100' means that for a value of
> N=5, the first 100 loops of each test are not taken into account
> for timing so that effects of the various processor and memory caches,
> the operating system's memory page swapping, etc., etc., doesn't have
> too much influence. This should be just a very small fraction of
> time, not 70s.
>
> > I am thinking of what else can be done and waiting for your test.
>
> Just looking at your most recent HTML page I see some peculiarities.
>
> * What exactly means 'Baseline (ms)'? Is the shown number the time
> for one loop? For all loops together? Please clarify and mention
> this on the HTML page.
>
> * There seems to be a fundamental math problem in calculating the
> percentage numbers. For example, looking at the 'TOTAL' field, the
> percental difference between 2.788 and 2.740 is not -6.1% but -1.7%!
> What am I missing?
>
> * Looking at the 'Load_Advances (Unscaled)' row, I think that 100%
> difference between 0.001 and 0.002 doesn't make any sense. How do
> you compute the percentage? Is this based on the cumulative time of
> all loops? If so, and you really get such small numbers, there must
> be some fine-tuning for high-speed tests (for example, increasing N
> for this particular test by a factor of 10, say) to get meaningful
> timing values.
>
>
> Werner



Freetype Benchmark Results
Warning: Baseline and Benchmark have the same commit ID!
Info

InfoBaselineBenchmark
Parameters-c 1000 -w 100-c 1000 -w 100
Commit IDd7371720d7371720
Commit Date2023-08-03 19:08:57 +03002023-08-03 19:08:57 +0300
BranchGSoC-2023-AhmetGSoC-2023-Ahmet
* Cumulative time for iterations which is better in smaller values
Results for Roboto_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load12544769548050-0.6
Load_Advances (Normal)12472392483467-2.3
Load_Advances (Fast)12281828040.5
Load_Advances (Unscaled)1227742875-3.6
Render12407268425227-4.4
Get_Glyph12160786166644-3.6
Get_Char_Index940002728231815.0
Iterate CMap1000177117183.0
New_Face100039404390151.0
Embolden122140852099871.9
Stroke68160 | 65880162217116184290.2
Get_BBox12101134101693-0.6
Get_CBox1281055792772.2
New_Face & load glyph(s)1297837100719-2.9
TOTAL2726040375099237822230.8

Results for Arial_subset.ttf

TestN* Baseline (µs)* Benchmark (µs)Difference (%)
Load95000696891751976-7.9
Load_Advances (Normal)95000614680740438-20.5
Load_Advances (Fast)9500022922519-9.9
Load_Advances (Unscaled)9500021682516-16.1
Render950003335063251032.5
Get_Glyph950001425961378

Re: -warmup

2023-08-03 Thread Werner LEMBERG


> It is warming up as the given number of seconds with -w flag before
> every benchmark test.
>
> There are still differences like 100%..  Also, 1 sec warmup means
> (test count)*(font count) 70 secs for the results.

Mhmm, I'm not sure whether a warmup *time span* makes sense.  I would
rather have thought that every test would get a certain number of
warmup *loops*.  For example, '--warmup 100' means that for a value of
N=5, the first 100 loops of each test are not taken into account
for timing so that effects of the various processor and memory caches,
the operating system's memory page swapping, etc., etc., doesn't have
too much influence.  This should be just a very small fraction of
time, not 70s.

> I am thinking of what else can be done and waiting for your test.

Just looking at your most recent HTML page I see some peculiarities.

* What exactly means 'Baseline (ms)'?  Is the shown number the time
  for one loop?  For all loops together?  Please clarify and mention
  this on the HTML page.

* There seems to be a fundamental math problem in calculating the
  percentage numbers.  For example, looking at the 'TOTAL' field, the
  percental difference between 2.788 and 2.740 is not -6.1% but -1.7%!
  What am I missing?

* Looking at the 'Load_Advances (Unscaled)' row, I think that 100%
  difference between 0.001 and 0.002 doesn't make any sense.  How do
  you compute the percentage?  Is this based on the cumulative time of
  all loops?  If so, and you really get such small numbers, there must
  be some fine-tuning for high-speed tests (for example, increasing N
  for this particular test by a factor of 10, say) to get meaningful
  timing values.


 Werner



-warmup

2023-08-03 Thread Ahmet Göksu
Hi,
I have updated the bench code.
It is warming up as the given number of seconds with -w flag before every 
benchmark test.

There are still differences like 100%..
Also, 1 sec warmup means (test count)*(font count) 70 secs for the results.

I am thinking of what else can be done and waiting for your test.

Best,
Goksu
goksu.in
On 3 Aug 2023 05:45 +0300, Werner LEMBERG , wrote:
>
> > I have done the changes you want.
>
> Thanks!
>
> > > 36.5% run difference is bd. AFAICS, you haven't yet worked on
> > > omitting 'warmup' iterations, right?
> >
> > I am planning to increase the iteration count by 10% and ignore the
> > results for them.
>
> You mean you are going to ignore the first 10% of the iterations?
> This might be a good default. However, I still think that a
> `--warmup` command-line option makes sense to control this.
>
> > Trying to figure out the benchmarking program but actually drowning
> > in 1500 lines of code.
>
> :-) I hope you can eventually find what you need.
>
>
> Werner



Freetype Benchmark Results
Warning: Baseline and Benchmark have the same commit ID!
Info

InfoBaselineBenchmark
Parameters-c 500 -w 1-c 500 -w 1
Commit IDfae5e8c9fae5e8c9
Commit Date2023-08-01 17:37:55 +03002023-08-01 17:37:55 +0300
BranchGSoC-2023-AhmetGSoC-2023-Ahmet
* Smaller values mean faster operation
Results for Roboto_subset.ttf

TestN* Baseline (ms)* Benchmark (ms)Difference (%)
Load598800.2500.270-8.0
Load_Advances (Normal)598800.3110.24321.9
Load_Advances (Fast)598800.0020.0020.0
Load_Advances (Unscaled)598800.0010.002-100.0
Render598800.2200.2180.9
Get_Glyph598800.0860.0860.0
Get_Char_Index469060.0010.0010.0
Iterate CMap4990.0010.0010.0
New_Face4990.0220.0220.0
Embolden598800.1150.1150.0
Stroke571201.6281.629-0.1
Get_BBox598800.0550.0550.0
Get_CBox598800.0430.0422.3
New_Face & load glyph(s)598800.0530.054-1.9
TOTAL14076482.7882.740-6.1

Results for Arial_subset.ttf

TestN* Baseline (ms)* Benchmark (ms)Difference (%)
Load474050.3900.3763.6
Load_Advances (Normal)474050.3440.3401.2
Load_Advances (Fast)474050.0010.0010.0
Load_Advances (Unscaled)474050.0010.0010.0
Render474050.1790.186-3.9
Get_Glyph474050.0750.076-1.3
Get_Char_Index469060.0010.0010.0
Iterate CMap4990.0010.0010.0
New_Face4990.0280.02510.7
Embolden474050.1030.1030.0
Stroke474051.3241.329-0.4
Get_BBox474050.0490.051-4.1
Get_CBox474050.0350.036-2.9
New_Face & load glyph(s)474050.0600.05410.0
TOTAL11387182.5912.5800.9

Results for TimesNewRoman_subset.ttf

TestN* Baseline (ms)* Benchmark (ms)Difference (%)
Load474050.4550.464-2.0
Load_Advances (Normal)474050.4190.4121.7
Load_Advances (Fast)474050.0010.0010.0
Load_Advances (Unscaled)474050.0010.0010.0
Render474050.2010.1972.0
Get_Glyph474050.0780.082-5.1
Get_Char_Index469060.0010.0010.0
Iterate CMap4990.0010.0010.0
New_Face4990.0270.0270.0
Embolden474050.1400.1390.7
Stroke401851.5451.5450.0
Get_BBox474050.0580.059-1.7
Get_CBox474050.0370.041-10.8
New_Face & load glyph(s)474050.0690.076-10.1
TOTAL11242783.0333.046-1.8

Results for Tahoma_subset.ttf

TestN* Baseline (ms)* Benchmark (ms)Difference (%)
Load474050.2650.289-9.1
Load_Advances (Normal)474050.2450.2392.4
Load_Advances (Fast)474050.0010.0010.0
Load_Advances (Unscaled)474050.0010.0010.0
Render474050.1660.1660.0
Get_Glyph474050.0750.0715.3
Get_Char_Index469060.0010.0010.0
Iterate CMap4990.0010.0010.0
New_Face4990.0250.0244.0
Embolden474050.1180.1088.5
Stroke474051.2441.2360.6
Get_BBox47405