Re: Proposal for a new applet: strings

2023-07-24 Thread Roberto A. Foglietta
On Sun, 23 Jul 2023 at 16:38, tito  wrote:
>
> On Sun, 23 Jul 2023 16:17:54 +0200
> "Roberto A. Foglietta"  wrote:
>
> > On Sun, 23 Jul 2023 at 13:18, tito  wrote:
> > >
> > > On Sun, 23 Jul 2023 12:00:56 +0200
> > > "Roberto A. Foglietta"  wrote:
> >
> > > > >
> > > > > 1) multiple file handling (a must i would dare to say)
> > > >
> > > > Which is not such a problem, after all
> > > >
> > > > for i in "$@"; do simply-strings "$i" | sed -e "s/^/$i:/"; done
> > > >
> > > > the sed will include also the file name in front of the string which
> > > > is useful for grepping. However, the single-file limitation brings to
> > > > personalize the approach:
> > > >
> > > > for i in "$@"; do simply-strings "$i" | grep -w "word" && break; done; 
> > > > echo $i
> > >
> > > Don't cheat, this change would break other people's scripts.
> >
> > Other people are not anymore into the scene, since the moment that we
> > established that reinventing the wheel is not efficient nor useful.
> >
> >
> > >
> > > > Yes, strings has a lot of options and also busybox have several
> > > > options. This is the best critic about proceeding with an integration.
> > > > I will check if I can put an optimization into bb strings, just for my
> > > > own curiosity.
> > >
> > > This would be far better than reinventing the wheel.
> > >
> >
> > Reinventing the wheel is a good way to understand how the wheel works
> > and improve it. We just concluded that there is no reason to reinvent
> > the wheel completely. However, the simple-strings can be useful when
> > its deployment fixes fulfill a void better than replacing a
> > fundamental system component like busybox which can break future OTA.
>
> Ever thought about compiling a busybox copy with only one applet
> or few applets that need fixes or updates or new features ?
> This was done a lot in the first android roms.
>
> > In particular, it is fine as a service/rescue/recovery image in which
> > the space is limited and the full compatibility with strings or
> > busybox strings is not necessary and for everything else custom
> > scripts can easily compensate.
> >
> > About improving busybox strings and more in general its printf
> > performance, it is about this:
> >
> > setvbuf(stdout, (char *)stdout_buffer, _IOFBF, BUFSIZE);
> >
> > Obviously a large static buffer can impact the footprint but as long
> > as malloc() is used into the busybox - and in its library I remember
> > there were sanitising wrappers for it - then it would not be such a
> > big deal to use a dynamically allocated buffer. The tricky aspect is
> > about the applet forking. A topic that I do not know but I saw an
> > option "no fork" in the config. I did not even start to see the code,
> > therefore I am just wondering about.
>
> Yes busybox code is tricky. This NO_FORK stuff is a black magic
> I really haven't understood yet.

I did not investigate that option nor the code but I have the
sensation that it would be useful in two different cases:

1 - single applet busybox
2 - NOMMU systems for which v/fork is a burden

My speculation is that when I call busybox, it forks on the applet
function which drops everything it does not need and each call is a
full detached process. With NO_FORK, I suppose that everything remains
in memory and as much as possible the kernel keeps it in memory as a
shared object. For example the code of busybox. While for each call,
it duplicates the stack like a function in pthreads does. Therefore,
every buffer defined is duplicated into each stack, by default unless
a special definition messes up this general principle.

About using setvbuf() in busybox:

setvbuf(stdout, (char *)stdout_buffer, _IOFBF, BUFSIZE);

It does not seem a viable solution for every applet. Therefore, I
would insert into strings only and few others. Doing a grep into
busybox code that function has been used in few applets:

$ grep -rn setvbuf . 2>/dev/null | grep \.c:
./miscutils/hexedit.c:263: setvbuf(stdout, xmalloc(sz), _IOFBF, sz);
./coreutils/tee.c:126: setvbuf(stdout, NULL, _IONBF, 0);
./shell/match.c:105: setvbuf(stdout, NULL, _IONBF, 0);
./runit/svlogd.c:597: setvbuf(ld->filecur, NULL, _IOFBF, linelen); 
./runit/svlogd.c:860: setvbuf(ld->filecur, NULL, _IOFBF, linelen); 
./runit/svlogd.c:1128: setvbuf(stderr, NULL, _IOFBF, linelen);

The man https://linux.die.net/man/3/setvbuf explains that in busybox
just the single line is buffered except for hexedit. The full
buffering, it might be useful also in dd when stdout is used and
strings. Considering that defining a buffer in a function (applet)
implies increasing the size of the executable, it makes sense using a
malloc (a BB wrapper for it). After all, the malloc() code is included
in busybox and using one more time adds just the ASM which is needed
to handle that function call.

Moreover, in my simple-strings, I have used a 4096 because I suppose
that it is the kernel memory page therefore it is necessarily a
contiguous physical RAM allocation, the 

Re: Proposal for a new applet: strings

2023-07-23 Thread tito
On Sun, 23 Jul 2023 16:17:54 +0200
"Roberto A. Foglietta"  wrote:

> On Sun, 23 Jul 2023 at 13:18, tito  wrote:
> >
> > On Sun, 23 Jul 2023 12:00:56 +0200
> > "Roberto A. Foglietta"  wrote:
> 
> > > >
> > > > 1) multiple file handling (a must i would dare to say)
> > >
> > > Which is not such a problem, after all
> > >
> > > for i in "$@"; do simply-strings "$i" | sed -e "s/^/$i:/"; done
> > >
> > > the sed will include also the file name in front of the string which
> > > is useful for grepping. However, the single-file limitation brings to
> > > personalize the approach:
> > >
> > > for i in "$@"; do simply-strings "$i" | grep -w "word" && break; done; 
> > > echo $i
> >
> > Don't cheat, this change would break other people's scripts.
> 
> Other people are not anymore into the scene, since the moment that we
> established that reinventing the wheel is not efficient nor useful.
> 
> 
> >
> > > Yes, strings has a lot of options and also busybox have several
> > > options. This is the best critic about proceeding with an integration.
> > > I will check if I can put an optimization into bb strings, just for my
> > > own curiosity.
> >
> > This would be far better than reinventing the wheel.
> >
> 
> Reinventing the wheel is a good way to understand how the wheel works
> and improve it. We just concluded that there is no reason to reinvent
> the wheel completely. However, the simple-strings can be useful when
> its deployment fixes fulfill a void better than replacing a
> fundamental system component like busybox which can break future OTA.

Ever thought about compiling a busybox copy with only one applet
or few applets that need fixes or updates or new features ?
This was done a lot in the first android roms.

> In particular, it is fine as a service/rescue/recovery image in which
> the space is limited and the full compatibility with strings or
> busybox strings is not necessary and for everything else custom
> scripts can easily compensate.
> 
> About improving busybox strings and more in general its printf
> performance, it is about this:
> 
> setvbuf(stdout, (char *)stdout_buffer, _IOFBF, BUFSIZE);
> 
> Obviously a large static buffer can impact the footprint but as long
> as malloc() is used into the busybox - and in its library I remember
> there were sanitising wrappers for it - then it would not be such a
> big deal to use a dynamically allocated buffer. The tricky aspect is
> about the applet forking. A topic that I do not know but I saw an
> option "no fork" in the config. I did not even start to see the code,
> therefore I am just wondering about.

Yes busybox code is tricky. This NO_FORK stuff is a black magic
I really haven't understood yet.

> 
> > >
> > > > 3) output compatible with original gnu strings
> > > >
> > > > > In attachment the new version with the test suite and the benchmark
> > > > > suite in the header. The benchmark suite did not change with respect
> > > > > to the script file I just sent.
> > > > >
> > > > > Best regards, R-
> > > >
> > > > BTW: there still seem to be corner-cases:
> > > > list=`find /usr`
> > > > for i in $list; do if test -f $i; then  ./strings $i > out1.txt; 
> > > > strings $i > out2.txt; diff -q out1.txt out2.txt; fi; done
> > > > Files out1.txt and out2.txt differ
> > > > Files out1.txt and out2.txt differ
> > > > Files out1.txt and out2.txt differ
> > > > Files out1.txt and out2.txt differ
> > > >
> > > > test is still running
> > >
> > > ok, I will do a run. Can you please echo the finenames, instead?
> > >
> > > for i in $list; do if test -f $i; then  ./strings $i > out1.txt;
> > > strings $i > out2.txt; diff -q out1.txt out2.txt >/dev/null || echo
> > > $i; fi; done
> > >
> 
> The version in attachment also solves the rest of the problem that my
> /usr could have raised with the previous version.  Moreover, I have
> further developed the benchmark and the testing suites. You might find
> interesting the new part of the benchmark suite about 'dd' used as an
> alternative of /dev/null for giving us a transfer speed. As you can
> see, if you wish to do strings on tmpfs then for each different file
> you need to copy it into the tmpfs. For this reason, copying in tmpfs
> + 100 strings run on the same file is like cheating <-- you started!
> ;-)

I will study it.

> 
> 
> >
> > if you hire me as beta testerat least you own me a beer if we ever met 
> > in person.
> >
> 
> Sure, you are welcome. I live in Genoa, at the moment - you can easily
> find my mobile telephone number by googling my name (well, to be
> precise: it is a brand strongly based on my name).

it is rather far, maybe one day to visit the aquarium.

> 
> In another context, I saw that there is the policy of paying by paypal
> & co. a small amount of money IMHO, it is a very bad marketing policy
> which seriously impair the value of a professionist. However, when

Forget about the money, I prefer beer anyway,

> someone acts outside its professional sector like - blogging,
> 

Re: Proposal for a new applet: strings

2023-07-23 Thread Roberto A. Foglietta
On Sun, 23 Jul 2023 at 13:18, tito  wrote:
>
> On Sun, 23 Jul 2023 12:00:56 +0200
> "Roberto A. Foglietta"  wrote:

> > >
> > > 1) multiple file handling (a must i would dare to say)
> >
> > Which is not such a problem, after all
> >
> > for i in "$@"; do simply-strings "$i" | sed -e "s/^/$i:/"; done
> >
> > the sed will include also the file name in front of the string which
> > is useful for grepping. However, the single-file limitation brings to
> > personalize the approach:
> >
> > for i in "$@"; do simply-strings "$i" | grep -w "word" && break; done; echo 
> > $i
>
> Don't cheat, this change would break other people's scripts.

Other people are not anymore into the scene, since the moment that we
established that reinventing the wheel is not efficient nor useful.


>
> > Yes, strings has a lot of options and also busybox have several
> > options. This is the best critic about proceeding with an integration.
> > I will check if I can put an optimization into bb strings, just for my
> > own curiosity.
>
> This would be far better than reinventing the wheel.
>

Reinventing the wheel is a good way to understand how the wheel works
and improve it. We just concluded that there is no reason to reinvent
the wheel completely. However, the simple-strings can be useful when
its deployment fixes fulfill a void better than replacing a
fundamental system component like busybox which can break future OTA.
In particular, it is fine as a service/rescue/recovery image in which
the space is limited and the full compatibility with strings or
busybox strings is not necessary and for everything else custom
scripts can easily compensate.

About improving busybox strings and more in general its printf
performance, it is about this:

setvbuf(stdout, (char *)stdout_buffer, _IOFBF, BUFSIZE);

Obviously a large static buffer can impact the footprint but as long
as malloc() is used into the busybox - and in its library I remember
there were sanitising wrappers for it - then it would not be such a
big deal to use a dynamically allocated buffer. The tricky aspect is
about the applet forking. A topic that I do not know but I saw an
option "no fork" in the config. I did not even start to see the code,
therefore I am just wondering about.


> >
> > > 3) output compatible with original gnu strings
> > >
> > > > In attachment the new version with the test suite and the benchmark
> > > > suite in the header. The benchmark suite did not change with respect
> > > > to the script file I just sent.
> > > >
> > > > Best regards, R-
> > >
> > > BTW: there still seem to be corner-cases:
> > > list=`find /usr`
> > > for i in $list; do if test -f $i; then  ./strings $i > out1.txt; strings 
> > > $i > out2.txt; diff -q out1.txt out2.txt; fi; done
> > > Files out1.txt and out2.txt differ
> > > Files out1.txt and out2.txt differ
> > > Files out1.txt and out2.txt differ
> > > Files out1.txt and out2.txt differ
> > >
> > > test is still running
> >
> > ok, I will do a run. Can you please echo the finenames, instead?
> >
> > for i in $list; do if test -f $i; then  ./strings $i > out1.txt;
> > strings $i > out2.txt; diff -q out1.txt out2.txt >/dev/null || echo
> > $i; fi; done
> >

The version in attachment also solves the rest of the problem that my
/usr could have raised with the previous version.  Moreover, I have
further developed the benchmark and the testing suites. You might find
interesting the new part of the benchmark suite about 'dd' used as an
alternative of /dev/null for giving us a transfer speed. As you can
see, if you wish to do strings on tmpfs then for each different file
you need to copy it into the tmpfs. For this reason, copying in tmpfs
+ 100 strings run on the same file is like cheating <-- you started!
;-)


>
> if you hire me as beta testerat least you own me a beer if we ever met in 
> person.
>

Sure, you are welcome. I live in Genoa, at the moment - you can easily
find my mobile telephone number by googling my name (well, to be
precise: it is a brand strongly based on my name).

In another context, I saw that there is the policy of paying by paypal
& co. a small amount of money IMHO, it is a very bad marketing policy
which seriously impair the value of a professionist. However, when
someone acts outside its professional sector like - blogging,
zero-hope commercial projects, end-users guides, et similia - then it
is fine to ask, IMHO. As long as it is clear what someone asks.

More in general, my common attitude is to raise and save money to
start my own company and pay people to work for/with me. But everytime
my incoming or my company business is going well, the people around me
go mad and f*ck-up everything without any reasonable way to stop them.
Now, I got quita a clear picture about it but this is definitely
off-topic.

Cheers, R-
/*
 * (C) 2023, Roberto A. Foglietta 
 *   Released under the GPLv2 license terms.
 *
 * This is a rework of the original source code in public domain which is here:
 *
 * 

Re: Proposal for a new applet: strings

2023-07-23 Thread tito
On Sun, 23 Jul 2023 12:00:56 +0200
"Roberto A. Foglietta"  wrote:

> On Sun, 23 Jul 2023 at 11:42, tito  wrote:
> >
> > On Sun, 23 Jul 2023 00:36:09 +0200
> > "Roberto A. Foglietta"  wrote:
> >
> > > On Sat, 22 Jul 2023 at 21:29, tito  wrote:
> > > >
> > > > On Sat, 22 Jul 2023 19:31:28 +0200
> > > > "Roberto A. Foglietta"  wrote:
> > > >
> > > > > On Sat, 22 Jul 2023 at 15:40, tito  wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm not the maintainer so I can say nothing about integration,
> > > > > > I can just point out things that look strange to me and my limited 
> > > > > > knowledge.
> > > > > > When I read that this code is faster vs other code as I'm a curious
> > > > > > person I just try to see how much faster it is and why as there
> > > > > > is always something to learn on the busybox mailing list.
> > > > > > If in my little tests it is not faster then I think I'm entitled
> > > > > > to ask questions about it as science results should be reproducible.
> > > > > >
> > > > > > For simple benchmarking maybe reading a big enough file
> > > > > > into memory and feeding it to strings in a few 1000 iterations
> > > > > > should do to avoid bias from hdd/sdd and system load, one shot 
> > > > > > shows:
> > > > > >
> > > > > > ramtmp="$(mktemp -p /dev/shm/)"
> > > > > >  dd if=vmlinux.o of=$ramtmp
> > > > > > echo $ramtmp
> > > > > > /dev/shm/tmp.ll3G2kzKE1
> > > > > >
> > > > > > 1) coreutils strings
> > > > > > time  strings $ramtmp > /dev/null
> > > > >
> > > > > This is not correct because you are reading a file in tmpfs while the
> > > >
> > > > Yes, this was exactly the purpose of the test to eliminate all
> > > > factors connected to underlying block devices and time
> > > > the speed of code of the different implementations.
> > > >
> > >
> > > Which is wrong because you did a hypothesis which is far away from the
> > > typical usage and in some cases you can even use it because strings
> > > over a 4GB ISO image would not necessarily fit into a tmpfs in every
> > > system. Abstract benchmarks can be funny but do not depict/measure the
> > > reality as usual. Extending this logic, we can trash the Ohm law
> > > because we can reach in the laboratory a near zero temperature!
> >
> > I see but dropping the caches etc doesn't seem to be a typical use case 
> > either.
> 
> Dropping the cache is a trick to bring the system in its state after
> the boot or as much as possible at that point. It is indispensable for
> a confrontation with the normal functioning which has a larger
> variance in completion time for each runs.
> 
> >
> > Using the same optimization flag -O3 the busybox applet in a real life
> > system gives close empirical results, which is the results most
> > people in their normal life use cases (one shot, no loops running,
> > no files in memory, no dropped caches, no giant multi-GB files)
> > will see so the performance increase is swallowed by the system
> > or by other bottlenecks.
> >
> 
> This is correct, AFAIK my busybox has been compiled with -02. I have to check.
> 
> 
> > I think the size will rather increase as there are a bunch of features
> > missing that the original bb implementation already has:
> >
> > 1) multiple file handling (a must i would dare to say)
> 
> Which is not such a problem, after all
> 
> for i in "$@"; do simply-strings "$i" | sed -e "s/^/$i:/"; done
> 
> the sed will include also the file name in front of the string which
> is useful for grepping. However, the single-file limitation brings to
> personalize the approach:
> 
> for i in "$@"; do simply-strings "$i" | grep -w "word" && break; done; echo $i

Don't cheat, this change would break other people's scripts.

> For example. However, I admit that you are right about multiple-files
> input. Personally, I do not need at all and if I need, I do with a
> custom for.
> 
> 
> > 2) -a -f -o -n -t command line options
> > The options are:
> >   -a - --allScan the entire file, not just the data section 
> > [default]
> >   -f --print-file-name  Print the name of the file before each string
> >   -n --bytes=[number]   Locate & print any NUL-terminated sequence of at
> >least [number] characters 
> > (default 4).
> >   -t --radix={o,d,x}Print the location of the string in base 8, 10 
> > or 16
> >   -oAn alias for --radix=o
> >
> 
> Yes, strings has a lot of options and also busybox have several
> options. This is the best critic about proceeding with an integration.
> I will check if I can put an optimization into bb strings, just for my
> own curiosity.

This would be far better than reinventing the wheel.

> 
> > 3) output compatible with original gnu strings
> >
> > > In attachment the new version with the test suite and the benchmark
> > > suite in the header. The benchmark suite did not change with respect
> > > to the script file I just sent.
> > >
> > > Best regards, R-
> >
> > 

Re: Proposal for a new applet: strings

2023-07-23 Thread Roberto A. Foglietta
On Sun, 23 Jul 2023 at 11:42, tito  wrote:
>
> On Sun, 23 Jul 2023 00:36:09 +0200
> "Roberto A. Foglietta"  wrote:
>
> > On Sat, 22 Jul 2023 at 21:29, tito  wrote:
> > >
> > > On Sat, 22 Jul 2023 19:31:28 +0200
> > > "Roberto A. Foglietta"  wrote:
> > >
> > > > On Sat, 22 Jul 2023 at 15:40, tito  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm not the maintainer so I can say nothing about integration,
> > > > > I can just point out things that look strange to me and my limited 
> > > > > knowledge.
> > > > > When I read that this code is faster vs other code as I'm a curious
> > > > > person I just try to see how much faster it is and why as there
> > > > > is always something to learn on the busybox mailing list.
> > > > > If in my little tests it is not faster then I think I'm entitled
> > > > > to ask questions about it as science results should be reproducible.
> > > > >
> > > > > For simple benchmarking maybe reading a big enough file
> > > > > into memory and feeding it to strings in a few 1000 iterations
> > > > > should do to avoid bias from hdd/sdd and system load, one shot shows:
> > > > >
> > > > > ramtmp="$(mktemp -p /dev/shm/)"
> > > > >  dd if=vmlinux.o of=$ramtmp
> > > > > echo $ramtmp
> > > > > /dev/shm/tmp.ll3G2kzKE1
> > > > >
> > > > > 1) coreutils strings
> > > > > time  strings $ramtmp > /dev/null
> > > >
> > > > This is not correct because you are reading a file in tmpfs while the
> > >
> > > Yes, this was exactly the purpose of the test to eliminate all
> > > factors connected to underlying block devices and time
> > > the speed of code of the different implementations.
> > >
> >
> > Which is wrong because you did a hypothesis which is far away from the
> > typical usage and in some cases you can even use it because strings
> > over a 4GB ISO image would not necessarily fit into a tmpfs in every
> > system. Abstract benchmarks can be funny but do not depict/measure the
> > reality as usual. Extending this logic, we can trash the Ohm law
> > because we can reach in the laboratory a near zero temperature!
>
> I see but dropping the caches etc doesn't seem to be a typical use case 
> either.

Dropping the cache is a trick to bring the system in its state after
the boot or as much as possible at that point. It is indispensable for
a confrontation with the normal functioning which has a larger
variance in completion time for each runs.

>
> Using the same optimization flag -O3 the busybox applet in a real life
> system gives close empirical results, which is the results most
> people in their normal life use cases (one shot, no loops running,
> no files in memory, no dropped caches, no giant multi-GB files)
> will see so the performance increase is swallowed by the system
> or by other bottlenecks.
>

This is correct, AFAIK my busybox has been compiled with -02. I have to check.


> I think the size will rather increase as there are a bunch of features
> missing that the original bb implementation already has:
>
> 1) multiple file handling (a must i would dare to say)

Which is not such a problem, after all

for i in "$@"; do simply-strings "$i" | sed -e "s/^/$i:/"; done

the sed will include also the file name in front of the string which
is useful for grepping. However, the single-file limitation brings to
personalize the approach:

for i in "$@"; do simply-strings "$i" | grep -w "word" && break; done; echo $i

For example. However, I admit that you are right about multiple-files
input. Personally, I do not need at all and if I need, I do with a
custom for.


> 2) -a -f -o -n -t command line options
> The options are:
>   -a - --allScan the entire file, not just the data section 
> [default]
>   -f --print-file-name  Print the name of the file before each string
>   -n --bytes=[number]   Locate & print any NUL-terminated sequence of at
>least [number] characters 
> (default 4).
>   -t --radix={o,d,x}Print the location of the string in base 8, 10 or 
> 16
>   -oAn alias for --radix=o
>

Yes, strings has a lot of options and also busybox have several
options. This is the best critic about proceeding with an integration.
I will check if I can put an optimization into bb strings, just for my
own curiosity.


> 3) output compatible with original gnu strings
>
> > In attachment the new version with the test suite and the benchmark
> > suite in the header. The benchmark suite did not change with respect
> > to the script file I just sent.
> >
> > Best regards, R-
>
> BTW: there still seem to be corner-cases:
> list=`find /usr`
> for i in $list; do if test -f $i; then  ./strings $i > out1.txt; strings $i > 
> out2.txt; diff -q out1.txt out2.txt; fi; done
> Files out1.txt and out2.txt differ
> Files out1.txt and out2.txt differ
> Files out1.txt and out2.txt differ
> Files out1.txt and out2.txt differ
>
> test is still running

ok, I will do a run. Can you please echo the finenames, 

Re: Proposal for a new applet: strings

2023-07-23 Thread tito
On Sun, 23 Jul 2023 00:36:09 +0200
"Roberto A. Foglietta"  wrote:

> On Sat, 22 Jul 2023 at 21:29, tito  wrote:
> >
> > On Sat, 22 Jul 2023 19:31:28 +0200
> > "Roberto A. Foglietta"  wrote:
> >
> > > On Sat, 22 Jul 2023 at 15:40, tito  wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm not the maintainer so I can say nothing about integration,
> > > > I can just point out things that look strange to me and my limited 
> > > > knowledge.
> > > > When I read that this code is faster vs other code as I'm a curious
> > > > person I just try to see how much faster it is and why as there
> > > > is always something to learn on the busybox mailing list.
> > > > If in my little tests it is not faster then I think I'm entitled
> > > > to ask questions about it as science results should be reproducible.
> > > >
> > > > For simple benchmarking maybe reading a big enough file
> > > > into memory and feeding it to strings in a few 1000 iterations
> > > > should do to avoid bias from hdd/sdd and system load, one shot shows:
> > > >
> > > > ramtmp="$(mktemp -p /dev/shm/)"
> > > >  dd if=vmlinux.o of=$ramtmp
> > > > echo $ramtmp
> > > > /dev/shm/tmp.ll3G2kzKE1
> > > >
> > > > 1) coreutils strings
> > > > time  strings $ramtmp > /dev/null
> > >
> > > This is not correct because you are reading a file in tmpfs while the
> >
> > Yes, this was exactly the purpose of the test to eliminate all
> > factors connected to underlying block devices and time
> > the speed of code of the different implementations.
> >
> 
> Which is wrong because you did a hypothesis which is far away from the
> typical usage and in some cases you can even use it because strings
> over a 4GB ISO image would not necessarily fit into a tmpfs in every
> system. Abstract benchmarks can be funny but do not depict/measure the
> reality as usual. Extending this logic, we can trash the Ohm law
> because we can reach in the laboratory a near zero temperature!

I see but dropping the caches etc doesn't seem to be a typical use case either.

Using the same optimization flag -O3 the busybox applet in a real life 
system gives close empirical results, which is the results most
people in their normal life use cases (one shot, no loops running,
no files in memory, no dropped caches, no giant multi-GB files)
will see so the performance increase is swallowed by the system
or by other bottlenecks.

> > >
> > > Lines particularly long, more than 4096 characters are divided into
> > > blocks with \n. It is clearly a corner case for which \n should be
> >
> > It was 35 corner cases in a handful of files due to a arbitrary hard-coded 
> > limitation.
> > You should maybe run it on `find /`
> 
> Which can be solved quite easily with a cast of a bool:
> 
> bool pr = isPrintable(*ch);
> 
> use this bool instead of the check into the code and change this one
> 
> #define print_text(p,b,c) if(p-b >= 4) { *p++ = 0; printf("%s%c",b,c); }
> 
> print_text(p, buffer, pr ? *ch : '\n'); // print collected text
> 
> in such a way it prints the current char rather than the new line.
> After this change the average of 2x speed has been maintained.
> 
> > >
> > > > I suspect this could be a problem for integration  and also size of 
> > > > code after integration is relevant.
> > >
> 
> The size can be reduced using two buffers only without losing
> performances because three are redundant. Which is the current size of
> the strings applet? Just to have an idea because the size of a single
> binary with main() et company cannot immediately be compared with a
> busybox applet. For sure is a lot more smaller than the strings which
> also require a large shared library:
> 
> size /usr/bin/strings
>textdata bss dec hex filename
>   209431472  64   2247957cf /usr/bin/strings
> 
> ldd /usr/bin/strings
> linux-vdso.so.1 (0x7ffde4fbc000)
> libbfd-2.38-system.so => /lib/x86_64-linux-gnu/libbfd-2.38-system.so
> (0x7f6d64cf1000)
> libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f6d64ac9000)
> libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f6d64aad000)
> /lib64/ld-linux-x86-64.so.2 (0x7f6d64e8c000)
> 
> size /lib/x86_64-linux-gnu/libbfd-2.38-system.so
>textdata bss dec hex filename
> 1434786   94000 680 1529466 17567a
> /lib/x86_64-linux-gnu/libbfd-2.38-system.so
> 
> > > It is a corner case that could be addressed. I did not check the size
> > > of strings in busybox. However, once confirmed that the size is more
> > > important than the speed for busybox - I agree on this - then it can
> > > be proposed to binutils (or coreutils) depending on which package is
> > > included. I found the binary version for aarch64 on binutils, AFAIR.
> >
> > I wonder why should they be wanting to change their stable code for
> > a new implementation?
> 
> Because It is very easy to check that it works, it is 2x faster on
> average and on a fine-tuned system can reach 4x, the size drops
> dramatically considering to free the binary from the large 

Re: Proposal for a new applet: strings

2023-07-22 Thread Roberto A. Foglietta
On Sat, 22 Jul 2023 at 21:29, tito  wrote:
>
> On Sat, 22 Jul 2023 19:31:28 +0200
> "Roberto A. Foglietta"  wrote:
>
> > On Sat, 22 Jul 2023 at 15:40, tito  wrote:
> >
> > > Hi,
> > >
> > > I'm not the maintainer so I can say nothing about integration,
> > > I can just point out things that look strange to me and my limited 
> > > knowledge.
> > > When I read that this code is faster vs other code as I'm a curious
> > > person I just try to see how much faster it is and why as there
> > > is always something to learn on the busybox mailing list.
> > > If in my little tests it is not faster then I think I'm entitled
> > > to ask questions about it as science results should be reproducible.
> > >
> > > For simple benchmarking maybe reading a big enough file
> > > into memory and feeding it to strings in a few 1000 iterations
> > > should do to avoid bias from hdd/sdd and system load, one shot shows:
> > >
> > > ramtmp="$(mktemp -p /dev/shm/)"
> > >  dd if=vmlinux.o of=$ramtmp
> > > echo $ramtmp
> > > /dev/shm/tmp.ll3G2kzKE1
> > >
> > > 1) coreutils strings
> > > time  strings $ramtmp > /dev/null
> >
> > This is not correct because you are reading a file in tmpfs while the
>
> Yes, this was exactly the purpose of the test to eliminate all
> factors connected to underlying block devices and time
> the speed of code of the different implementations.
>

Which is wrong because you did a hypothesis which is far away from the
typical usage and in some cases you can even use it because strings
over a 4GB ISO image would not necessarily fit into a tmpfs in every
system. Abstract benchmarks can be funny but do not depict/measure the
reality as usual. Extending this logic, we can trash the Ohm law
because we can reach in the laboratory a near zero temperature!

> >
> > Lines particularly long, more than 4096 characters are divided into
> > blocks with \n. It is clearly a corner case for which \n should be
>
> It was 35 corner cases in a handful of files due to a arbitrary hard-coded 
> limitation.
> You should maybe run it on `find /`

Which can be solved quite easily with a cast of a bool:

bool pr = isPrintable(*ch);

use this bool instead of the check into the code and change this one

#define print_text(p,b,c) if(p-b >= 4) { *p++ = 0; printf("%s%c",b,c); }

print_text(p, buffer, pr ? *ch : '\n'); // print collected text

in such a way it prints the current char rather than the new line.
After this change the average of 2x speed has been maintained.

> >
> > > I suspect this could be a problem for integration  and also size of code 
> > > after integration is relevant.
> >

The size can be reduced using two buffers only without losing
performances because three are redundant. Which is the current size of
the strings applet? Just to have an idea because the size of a single
binary with main() et company cannot immediately be compared with a
busybox applet. For sure is a lot more smaller than the strings which
also require a large shared library:

size /usr/bin/strings
   textdata bss dec hex filename
  209431472  64   2247957cf /usr/bin/strings

ldd /usr/bin/strings
linux-vdso.so.1 (0x7ffde4fbc000)
libbfd-2.38-system.so => /lib/x86_64-linux-gnu/libbfd-2.38-system.so
(0x7f6d64cf1000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f6d64ac9000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f6d64aad000)
/lib64/ld-linux-x86-64.so.2 (0x7f6d64e8c000)

size /lib/x86_64-linux-gnu/libbfd-2.38-system.so
   textdata bss dec hex filename
1434786   94000 680 1529466 17567a
/lib/x86_64-linux-gnu/libbfd-2.38-system.so

> > It is a corner case that could be addressed. I did not check the size
> > of strings in busybox. However, once confirmed that the size is more
> > important than the speed for busybox - I agree on this - then it can
> > be proposed to binutils (or coreutils) depending on which package is
> > included. I found the binary version for aarch64 on binutils, AFAIR.
>
> I wonder why should they be wanting to change their stable code for
> a new implementation?

Because It is very easy to check that it works, it is 2x faster on
average and on a fine-tuned system can reach 4x, the size drops
dramatically considering to free the binary from the large shared
library.

In attachment the new version with the test suite and the benchmark
suite in the header. The benchmark suite did not change with respect
to the script file I just sent.

Best regards, R-
/*
 * (C) 2023, Roberto A. Foglietta 
 *   Released under the GPLv2 license terms.
 *
 * This is a rework of the original source code in public domain which is here:
 *
 * https://stackoverflow.com/questions/51389969/\
 *  implementing-my-own-strings-tool-missing-sequences-gnu-strings-finds
 *
*** HOW TO COMPILE *
 
 gcc -Wall -O3 strings.c -o strings && strip strings
 
*** HOW TO TEST 

Re: Proposal for a new applet: strings

2023-07-22 Thread tito
On Sat, 22 Jul 2023 19:31:28 +0200
"Roberto A. Foglietta"  wrote:

> On Sat, 22 Jul 2023 at 15:40, tito  wrote:
> 
> > Hi,
> >
> > I'm not the maintainer so I can say nothing about integration,
> > I can just point out things that look strange to me and my limited 
> > knowledge.
> > When I read that this code is faster vs other code as I'm a curious
> > person I just try to see how much faster it is and why as there
> > is always something to learn on the busybox mailing list.
> > If in my little tests it is not faster then I think I'm entitled
> > to ask questions about it as science results should be reproducible.
> >
> > For simple benchmarking maybe reading a big enough file
> > into memory and feeding it to strings in a few 1000 iterations
> > should do to avoid bias from hdd/sdd and system load, one shot shows:
> >
> > ramtmp="$(mktemp -p /dev/shm/)"
> >  dd if=vmlinux.o of=$ramtmp
> > echo $ramtmp
> > /dev/shm/tmp.ll3G2kzKE1
> >
> > 1) coreutils strings
> > time  strings $ramtmp > /dev/null
> 
> This is not correct because you are reading a file in tmpfs while the

Yes, this was exactly the purpose of the test to eliminate all
factors connected to underlying block devices and time
the speed of code of the different implementations.

> normal operations do not happen in this way for almost all the cases.
> Sometimes in ramfs, usually not. While it makes perfectly sense that
> the output will be sent to a tmpfs especially for those devices that
> the hdd/sdd/flash is particularly slow. After all, the strings output
> is temporary for its nature and IMHO is piped with grep, usually.
> 
> >
> > of course a few more iterations would give statistically better results.
> 
> The suite I provided with benchmark.sh is the answer because with
> dropping cache en/disabled check the two most important system states
> with all the cases that matter in real life, AFAIK.
> 
> > 2)  busybox strings vs  new strings:
> >
> > for i in $list; do if test -f $i; then  ./Desktop/strings $i > out1.txt; 
> > ./Desktop/busybox strings $i > out2.txt; diff -q out1.txt out2.txt; fi; done
> > Files out1.txt and out2.txt differ
> 
> Confirmed that exists some differences in output with this:
> 
> for i in /usr/bin/*; do if test -f $i; then ./strings $i > out1.txt;
> busybox strings $i > out2.txt; diff -q out1.txt out2.txt || break; fi;
> done
> 
> diff -pruN out1.txt out2.txt
> 
> Lines particularly long, more than 4096 characters are divided into
> blocks with \n. It is clearly a corner case for which \n should be

It was 35 corner cases in a handful of files due to a arbitrary hard-coded 
limitation.
You should maybe run it on `find /`

> omitted in printing. Thanks for this test, I did some but I did not
> catch the 4096 buffer overrun.
> 
> > I suspect this could be a problem for integration  and also size of code 
> > after integration is relevant.
> 
> It is a corner case that could be addressed. I did not check the size
> of strings in busybox. However, once confirmed that the size is more
> important than the speed for busybox - I agree on this - then it can
> be proposed to binutils (or coreutils) depending on which package is
> included. I found the binary version for aarch64 on binutils, AFAIR.

I wonder why should they be wanting to change their stable code for
a new implementation?

> Best regards, R-
Have a nice weekend!

Ciao,
Tito
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposal for a new applet: strings

2023-07-22 Thread Roberto A. Foglietta
On Sat, 22 Jul 2023 at 15:40, tito  wrote:

> Hi,
>
> I'm not the maintainer so I can say nothing about integration,
> I can just point out things that look strange to me and my limited knowledge.
> When I read that this code is faster vs other code as I'm a curious
> person I just try to see how much faster it is and why as there
> is always something to learn on the busybox mailing list.
> If in my little tests it is not faster then I think I'm entitled
> to ask questions about it as science results should be reproducible.
>
> For simple benchmarking maybe reading a big enough file
> into memory and feeding it to strings in a few 1000 iterations
> should do to avoid bias from hdd/sdd and system load, one shot shows:
>
> ramtmp="$(mktemp -p /dev/shm/)"
>  dd if=vmlinux.o of=$ramtmp
> echo $ramtmp
> /dev/shm/tmp.ll3G2kzKE1
>
> 1) coreutils strings
> time  strings $ramtmp > /dev/null

This is not correct because you are reading a file in tmpfs while the
normal operations do not happen in this way for almost all the cases.
Sometimes in ramfs, usually not. While it makes perfectly sense that
the output will be sent to a tmpfs especially for those devices that
the hdd/sdd/flash is particularly slow. After all, the strings output
is temporary for its nature and IMHO is piped with grep, usually.

>
> of course a few more iterations would give statistically better results.

The suite I provided with benchmark.sh is the answer because with
dropping cache en/disabled check the two most important system states
with all the cases that matter in real life, AFAIK.

> 2)  busybox strings vs  new strings:
>
> for i in $list; do if test -f $i; then  ./Desktop/strings $i > out1.txt; 
> ./Desktop/busybox strings $i > out2.txt; diff -q out1.txt out2.txt; fi; done
> Files out1.txt and out2.txt differ

Confirmed that exists some differences in output with this:

for i in /usr/bin/*; do if test -f $i; then ./strings $i > out1.txt;
busybox strings $i > out2.txt; diff -q out1.txt out2.txt || break; fi;
done

diff -pruN out1.txt out2.txt

Lines particularly long, more than 4096 characters are divided into
blocks with \n. It is clearly a corner case for which \n should be
omitted in printing. Thanks for this test, I did some but I did not
catch the 4096 buffer overrun.

> I suspect this could be a problem for integration  and also size of code 
> after integration is relevant.

It is a corner case that could be addressed. I did not check the size
of strings in busybox. However, once confirmed that the size is more
important than the speed for busybox - I agree on this - then it can
be proposed to binutils (or coreutils) depending on which package is
included. I found the binary version for aarch64 on binutils, AFAIR.

Best regards, R-
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposal for a new applet: strings

2023-07-22 Thread tito
On Sat, 22 Jul 2023 13:35:56 +0200
"Roberto A. Foglietta"  wrote:

> On Sat, 22 Jul 2023 at 08:02, tito  wrote:
> >
> > Hi,
> > I just adopted the test in the PERFORMANCE section of your source
> >
> > ** PERFORMANCES 
> > ***
> >  *
> >  * gcc -Wall -O3 strings.orig.c -o strings && strip strings
> >  * rm -f [12].txt
> >  * time   strings /usr/bin/busybox >1.txt
> >  * real 0m0.035s
> >  * time ./strings /usr/bin/busybox >2.txt
> >  * real 0m1.843s
> >  *
> >  * gcc -Wall -O3 strings.c -o strings && strip strings
> >  * rm -f [12].txt
> >  * time   strings /usr/bin/busybox >1.txt
> >  * real 0m0.033s
> >  * time ./strings /usr/bin/busybox >2.txt
> >  * real 0m0.011s
> >  *
> >  ** FOOTPRINT 
> > **
> >
> 
> Sorry Tito,
> 
> I do not want to be pedantic but that is not a benchmark suite. It is
> just the presentation of cherry-picked results that are meaningful
> about performances.
> 
> After having introduced the use of tmpfs, the "real 0m0.033s" should
> be changed to "real 0m0.022s" or better the 11 in 16 in such a way to
> maintain a reasonable proportion with the original code but who cares
> anymore about the original code?.
> 
> Now, we can argue - why I did not share the entire benchmark suite in
> the header since the beginning. Well, after all the benchmark suite is
> something like for i in 1..100 do time. This means that the "core
> engine" behind the source of completion time is the same (in good and
> bad shapes, both). Everything else is just a matter of simple math and
> presentation.
> 
> Your three repetitions might seem very similar to my results
> presentation with a tiny sensitive difference: one single result is
> NOT a statistics, it does not even take in consideration the
> hypothesis of a variance in performances therefore it is a statement
> (a source of truth!). While three repetitions makes me think that
> something about statistics has been overlooked or like I did - I
> appreciated your sense of humor in suggesting that I should present
> statistics instead of a statement in the header.
> 
> Finally, the mktmp on tmpfs is a must. Otherwise we are going to test
> the performance of our hdd/ssd when the execution time is faster than
> the I/O throughput. I have one of the fastest commercial SSD mounted
> on my laptop and therefore I do not care very much but an
> embedded/mobile system can have 8 CPU pipes like my laptop but a much
> slower flash on SoC.
> 
> This is just to say that a benchmark should take care about caching
> and disk I/O. Obviously, the disk I/O gets into the picture also when
> we read the file to strings. Unless the file is cached (and a cache
> system exists and works efficiently) but in input is part of the
> benchmark. How the file/stdin is read, is part of the stings way of
> working, obviously. Probably we also need to try something like "sync;
> echo 3 > /proc/sys/vm/drop_caches" before every execution and use
> something different to string than a busybox, something which is
> independent from strings and bb, both.
> 
> Finally, the benchmark suite should also do a single run before all
> the tests just to charge the cache and unleash the CPU at its maximum
> performance. Plus putting the CPU set to performance instead of
> anything else just to quickly make it work at its best.
> 
> The benchmark.sh in attachment does stuff like that and can run with
> drop-the-cache or not. For this reason, it requires root privileges to
> run. It shows at my home that in fact 33 / 11 are a very good
> estimation and another one could be 32 / 12 or anything in between.
> Therefore my statement in the header was a source of truth under the
> limitation of the tests I did, obviously.
> 
> In the code I have replaced the static inline function with a macro
> 
> #define isPrintable(c) ((c) == 0x09 || ((c) >= 0x20 && (c) <= 0x7e))
> 
> because the inline is not always granted and the code is used just a
> single time. Probably the -O3 gcc optimization got it and inline by
> default.
> 
> == QUESTION ==
> 
> Is this a preliminary work for the integration task or just an
> educated academic ping pong e-mail exchange?
> 
> Best regards, R-

Hi,

I'm not the maintainer so I can say nothing about integration,
I can just point out things that look strange to me and my limited knowledge.
When I read that this code is faster vs other code as I'm a curious
person I just try to see how much faster it is and why as there
is always something to learn on the busybox mailing list.
If in my little tests it is not faster then I think I'm entitled
to ask questions about it as science results should be reproducible.

For simple benchmarking maybe reading a big enough file 
into memory and feeding it to strings in a few 1000 iterations 
should do to avoid bias from hdd/sdd and system load, one shot shows:

ramtmp="$(mktemp -p /dev/shm/)"
 dd if=vmlinux.o of=$ramtmp

Re: Proposal for a new applet: strings

2023-07-22 Thread Roberto A. Foglietta
On Sat, 22 Jul 2023 at 08:02, tito  wrote:
>
> Hi,
> I just adopted the test in the PERFORMANCE section of your source
>
> ** PERFORMANCES 
> ***
>  *
>  * gcc -Wall -O3 strings.orig.c -o strings && strip strings
>  * rm -f [12].txt
>  * time   strings /usr/bin/busybox >1.txt
>  * real 0m0.035s
>  * time ./strings /usr/bin/busybox >2.txt
>  * real 0m1.843s
>  *
>  * gcc -Wall -O3 strings.c -o strings && strip strings
>  * rm -f [12].txt
>  * time   strings /usr/bin/busybox >1.txt
>  * real 0m0.033s
>  * time ./strings /usr/bin/busybox >2.txt
>  * real 0m0.011s
>  *
>  ** FOOTPRINT 
> **
>

Sorry Tito,

I do not want to be pedantic but that is not a benchmark suite. It is
just the presentation of cherry-picked results that are meaningful
about performances.

After having introduced the use of tmpfs, the "real 0m0.033s" should
be changed to "real 0m0.022s" or better the 11 in 16 in such a way to
maintain a reasonable proportion with the original code but who cares
anymore about the original code?.

Now, we can argue - why I did not share the entire benchmark suite in
the header since the beginning. Well, after all the benchmark suite is
something like for i in 1..100 do time. This means that the "core
engine" behind the source of completion time is the same (in good and
bad shapes, both). Everything else is just a matter of simple math and
presentation.

Your three repetitions might seem very similar to my results
presentation with a tiny sensitive difference: one single result is
NOT a statistics, it does not even take in consideration the
hypothesis of a variance in performances therefore it is a statement
(a source of truth!). While three repetitions makes me think that
something about statistics has been overlooked or like I did - I
appreciated your sense of humor in suggesting that I should present
statistics instead of a statement in the header.

Finally, the mktmp on tmpfs is a must. Otherwise we are going to test
the performance of our hdd/ssd when the execution time is faster than
the I/O throughput. I have one of the fastest commercial SSD mounted
on my laptop and therefore I do not care very much but an
embedded/mobile system can have 8 CPU pipes like my laptop but a much
slower flash on SoC.

This is just to say that a benchmark should take care about caching
and disk I/O. Obviously, the disk I/O gets into the picture also when
we read the file to strings. Unless the file is cached (and a cache
system exists and works efficiently) but in input is part of the
benchmark. How the file/stdin is read, is part of the stings way of
working, obviously. Probably we also need to try something like "sync;
echo 3 > /proc/sys/vm/drop_caches" before every execution and use
something different to string than a busybox, something which is
independent from strings and bb, both.

Finally, the benchmark suite should also do a single run before all
the tests just to charge the cache and unleash the CPU at its maximum
performance. Plus putting the CPU set to performance instead of
anything else just to quickly make it work at its best.

The benchmark.sh in attachment does stuff like that and can run with
drop-the-cache or not. For this reason, it requires root privileges to
run. It shows at my home that in fact 33 / 11 are a very good
estimation and another one could be 32 / 12 or anything in between.
Therefore my statement in the header was a source of truth under the
limitation of the tests I did, obviously.

In the code I have replaced the static inline function with a macro

#define isPrintable(c) ((c) == 0x09 || ((c) >= 0x20 && (c) <= 0x7e))

because the inline is not always granted and the code is used just a
single time. Probably the -O3 gcc optimization got it and inline by
default.

== QUESTION ==

Is this a preliminary work for the integration task or just an
educated academic ping pong e-mail exchange?

Best regards, R-


benchmark.sh
Description: application/shellscript
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposal for a new applet: strings

2023-07-22 Thread tito
On Sat, 22 Jul 2023 03:36:52 +0200
"Roberto A. Foglietta"  wrote:

> On Fri, 21 Jul 2023 at 22:37, tito  wrote:
> >
> > On Fri, 21 Jul 2023 21:39:57 +0200
> > "Roberto A. Foglietta"  wrote:
> >
> > > To the maintainers and everyone else whom can be interested,
> > >
> [...]
> >
> > Hi,
> > seems to me that the current strings busybox implementation is faster.
> >
> [...]
> 
> @Tito, first of all thanks for the test.
> 
> The 'strings' applet into busybox is good enough for me and moreover,
> it is not about performance. The busybox installed into the system can
> theoretically be replaced but adding the only thing I am missing is
> easier. For my personal needs, this thread could be ended here.
> 
> Just for sake of completeness, I took some tests too. There are
> several ways to use 'strings' and they belong to a matrix of 2x3 cases
> in:{file, stdin} X out:{file, stdout, /dev/null}.
> 
> We can agree about the stdin case that the time of completion greatly
> depends on the way in which stdin feeds the command and among many
> ways the 'cat' is probably the most typical. Another aspect that can
> change the performances is about pre-compiled VS locally-compiled
> binaries. Therefore, the standard pre-compiled busybox has been
> introduced into the tests which finally create a 4 dimensions matrix
> like this {gnu strings, system busybox, compiled busybox, compiled
> strings} x :{file, stdin} X out:{file, stdout, /dev/null}.
> 
> Finally, we need a test function with parameters and a basic but
> reasonably accurate average calculation. Using this function over the
> above 4D matrix, I am feeling confident in saying that:
> 
> 1. there is no difference between pre-compiled and locally-compiled busybox
> 2. the busybox strings is as fast as GNU strings
> 3. the code I sent is on average 2x faster
> 4. the code I sent worst case is just 23% slower of the best of the
> others 3 cases
> 5. the code I sent best case is above 4x faster than the best of the
> others 3 cases
> 
> Here below all the information to replicate the test on different
> systems and architectures.
> 
> cat /proc/cpuinfo  | grep "model name" | tail -n1
> model name : Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
> 
> GNOME Terminal
> Version 3.44.0 for GNOME 42
> Using VTE version 0.68.0 +BIDI +GNUTLS +ICU +SYSTEMD
> 
> stats() {
> rm -f t.time
> local cmd=${1:-/usr/bin/busybox strings /usr/bin/busybox} n=${2:-100}
> for i in $(seq 1 $n); do eval time $cmd; done 2>t.time
> {
> echo
> echo "$cmd"
> sed -ne "s,real\t,min: ,p" t.time | sort -n | head -n1
> let avg=$(sed -ne "s,real\t0m0.[0]*\([0-9]*\)s,\\1,p" t.time | tr '\n' '+')0
> printf "avg: 0m0.%03ds\n" $(( (50+$avg)/100 ))
> sed -ne "s,real\t,max: ,p" t.time | sort -n | tail -n1
> } >&2
> }
> 
> =
> 
> /usr/bin/busybox | head -n1
> BusyBox v1.30.1 (Ubuntu 1:1.30.1-7ubuntu3) multi-call binary.
> 
> rm -f 1.txt; cmd="/usr/bin/busybox strings /usr/bin/busybox";
> stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt
> 
> /usr/bin/busybox strings /usr/bin/busybox
> min: 0m0.067s
> avg: 0m0.090s
> max: 0m0.134s
> 
> /usr/bin/busybox strings /usr/bin/busybox
> min: 0m0.013s
> avg: 0m0.013s
> max: 0m0.016s
> 
> /usr/bin/busybox strings /usr/bin/busybox
> min: 0m0.013s
> avg: 0m0.014s
> max: 0m0.018s
> 
> rm -f 1.txt; cmd="cat /usr/bin/busybox | /usr/bin/busybox strings";
> stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt
> 
> cat /usr/bin/busybox | /usr/bin/busybox strings
> min: 0m0.072s
> avg: 0m0.095s
> max: 0m0.141s
> 
> cat /usr/bin/busybox | /usr/bin/busybox strings
> min: 0m0.021s
> avg: 0m0.022s
> max: 0m0.026s
> 
> cat /usr/bin/busybox | /usr/bin/busybox strings
> min: 0m0.021s
> avg: 0m0.022s
> max: 0m0.025s
> 
> =
> 
> ./busybox | head -n1
> BusyBox v1.37.0.git (2023-07-21 15:06:26 CEST) multi-call binary.
> gcc --version | head -n1
> gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
> 
> rm -f 1.txt; cmd="./busybox strings /usr/bin/busybox";
> stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.tx
> 
> ./busybox strings /usr/bin/busybox
> min: 0m0.064s
> avg: 0m0.089s
> max: 0m0.123s
> 
> ./busybox strings /usr/bin/busybox
> min: 0m0.014s
> avg: 0m0.014s
> max: 0m0.016s
> 
> ./busybox strings /usr/bin/busybox
> min: 0m0.014s
> avg: 0m0.014s
> max: 0m0.016s
> 
> rm -f 1.txt; cmd="cat /usr/bin/busybox | ./busybox strings";
> stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt
> 
> cat /usr/bin/busybox | ./busybox strings
> min: 0m0.075s
> avg: 0m0.098s
> max: 0m0.148s
> 
> cat /usr/bin/busybox | ./busybox strings
> min: 0m0.022s
> avg: 0m0.023s
> max: 0m0.033s
> 
> cat /usr/bin/busybox | ./busybox strings
> min: 0m0.023s
> avg: 0m0.023s
> max: 0m0.027s
> 
> =
> 
> gcc -Wall -O3 strings.c -o strings && strip strings
> gcc --version | head -n1
> gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
> 
> rm -f 1.txt; cmd="../redfishos/recovery/strings /usr/bin/busybox";
> stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt
> 
> 

Re: Proposal for a new applet: strings

2023-07-21 Thread Roberto A. Foglietta
On Sat, 22 Jul 2023 at 03:36, Roberto A. Foglietta
 wrote:
>
> On Fri, 21 Jul 2023 at 22:37, tito  wrote:
> >
> > On Fri, 21 Jul 2023 21:39:57 +0200
> > "Roberto A. Foglietta"  wrote:

[...]

ERRATA CORRIGE

> rm -f 1.txt; cmd="cat /usr/bin/busybox | ../redfishos/recovery/strings 
> strings";
> stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

The first row above is wrong, it should be

rm -f 1.txt; cmd="cat /usr/bin/busybox | ../redfishos/recovery/strings";

However looking at the results, it was used correctly in the tests:

>
> cat /usr/bin/busybox | ../redfishos/recovery/strings
> min: 0m0.041s
> avg: 0m0.050s
> max: 0m0.082s
>

Simply, I did not update my notes after having fixed it on the command line.

Here below a quick test suite for 4 meaningful cases:

bbcmd=$(which busybox)

rm -f 1.txt; cmd="$bbcmd strings $bbcmd";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

rm -f 1.txt; cmd="cat $bbcmd | $bbcmd strings";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

rm -f 1.txt; cmd="./strings $bbcmd";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

rm -f 1.txt; cmd="cat $bbcmd | ./strings";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

You can easily replace the system busybox with your compiled version.

Finally, the stats() functions should use tmpfile=$(mktemp) instead of
t.time, just in case someone wishes to adopt it for general use. In
this case it is (c) 2023, me under GPLv2, as well.
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposal for a new applet: strings

2023-07-21 Thread Roberto A. Foglietta
On Fri, 21 Jul 2023 at 22:37, tito  wrote:
>
> On Fri, 21 Jul 2023 21:39:57 +0200
> "Roberto A. Foglietta"  wrote:
>
> > To the maintainers and everyone else whom can be interested,
> >
[...]
>
> Hi,
> seems to me that the current strings busybox implementation is faster.
>
[...]

@Tito, first of all thanks for the test.

The 'strings' applet into busybox is good enough for me and moreover,
it is not about performance. The busybox installed into the system can
theoretically be replaced but adding the only thing I am missing is
easier. For my personal needs, this thread could be ended here.

Just for sake of completeness, I took some tests too. There are
several ways to use 'strings' and they belong to a matrix of 2x3 cases
in:{file, stdin} X out:{file, stdout, /dev/null}.

We can agree about the stdin case that the time of completion greatly
depends on the way in which stdin feeds the command and among many
ways the 'cat' is probably the most typical. Another aspect that can
change the performances is about pre-compiled VS locally-compiled
binaries. Therefore, the standard pre-compiled busybox has been
introduced into the tests which finally create a 4 dimensions matrix
like this {gnu strings, system busybox, compiled busybox, compiled
strings} x :{file, stdin} X out:{file, stdout, /dev/null}.

Finally, we need a test function with parameters and a basic but
reasonably accurate average calculation. Using this function over the
above 4D matrix, I am feeling confident in saying that:

1. there is no difference between pre-compiled and locally-compiled busybox
2. the busybox strings is as fast as GNU strings
3. the code I sent is on average 2x faster
4. the code I sent worst case is just 23% slower of the best of the
others 3 cases
5. the code I sent best case is above 4x faster than the best of the
others 3 cases

Here below all the information to replicate the test on different
systems and architectures.

cat /proc/cpuinfo  | grep "model name" | tail -n1
model name : Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz

GNOME Terminal
Version 3.44.0 for GNOME 42
Using VTE version 0.68.0 +BIDI +GNUTLS +ICU +SYSTEMD

stats() {
rm -f t.time
local cmd=${1:-/usr/bin/busybox strings /usr/bin/busybox} n=${2:-100}
for i in $(seq 1 $n); do eval time $cmd; done 2>t.time
{
echo
echo "$cmd"
sed -ne "s,real\t,min: ,p" t.time | sort -n | head -n1
let avg=$(sed -ne "s,real\t0m0.[0]*\([0-9]*\)s,\\1,p" t.time | tr '\n' '+')0
printf "avg: 0m0.%03ds\n" $(( (50+$avg)/100 ))
sed -ne "s,real\t,max: ,p" t.time | sort -n | tail -n1
} >&2
}

=

/usr/bin/busybox | head -n1
BusyBox v1.30.1 (Ubuntu 1:1.30.1-7ubuntu3) multi-call binary.

rm -f 1.txt; cmd="/usr/bin/busybox strings /usr/bin/busybox";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

/usr/bin/busybox strings /usr/bin/busybox
min: 0m0.067s
avg: 0m0.090s
max: 0m0.134s

/usr/bin/busybox strings /usr/bin/busybox
min: 0m0.013s
avg: 0m0.013s
max: 0m0.016s

/usr/bin/busybox strings /usr/bin/busybox
min: 0m0.013s
avg: 0m0.014s
max: 0m0.018s

rm -f 1.txt; cmd="cat /usr/bin/busybox | /usr/bin/busybox strings";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

cat /usr/bin/busybox | /usr/bin/busybox strings
min: 0m0.072s
avg: 0m0.095s
max: 0m0.141s

cat /usr/bin/busybox | /usr/bin/busybox strings
min: 0m0.021s
avg: 0m0.022s
max: 0m0.026s

cat /usr/bin/busybox | /usr/bin/busybox strings
min: 0m0.021s
avg: 0m0.022s
max: 0m0.025s

=

./busybox | head -n1
BusyBox v1.37.0.git (2023-07-21 15:06:26 CEST) multi-call binary.
gcc --version | head -n1
gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

rm -f 1.txt; cmd="./busybox strings /usr/bin/busybox";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.tx

./busybox strings /usr/bin/busybox
min: 0m0.064s
avg: 0m0.089s
max: 0m0.123s

./busybox strings /usr/bin/busybox
min: 0m0.014s
avg: 0m0.014s
max: 0m0.016s

./busybox strings /usr/bin/busybox
min: 0m0.014s
avg: 0m0.014s
max: 0m0.016s

rm -f 1.txt; cmd="cat /usr/bin/busybox | ./busybox strings";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

cat /usr/bin/busybox | ./busybox strings
min: 0m0.075s
avg: 0m0.098s
max: 0m0.148s

cat /usr/bin/busybox | ./busybox strings
min: 0m0.022s
avg: 0m0.023s
max: 0m0.033s

cat /usr/bin/busybox | ./busybox strings
min: 0m0.023s
avg: 0m0.023s
max: 0m0.027s

=

gcc -Wall -O3 strings.c -o strings && strip strings
gcc --version | head -n1
gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

rm -f 1.txt; cmd="../redfishos/recovery/strings /usr/bin/busybox";
stats "$cmd "; stats "$cmd" >/dev/null; stats "$cmd" >1.txt

../redfishos/recovery/strings /usr/bin/busybox
min: 0m0.015s
avg: 0m0.040s
max: 0m0.079s

../redfishos/recovery/strings /usr/bin/busybox
min: 0m0.009s
avg: 0m0.009s
max: 0m0.010s

../redfishos/recovery/strings /usr/bin/busybox
min: 0m0.009s
avg: 0m0.010s
max: 0m0.021s

rm -f 1.txt; cmd="cat /usr/bin/busybox | ../redfishos/recovery/strings strings";
stats "$cmd "; stats "$cmd" >/dev/null; stats 

Re: Proposal for a new applet: strings

2023-07-21 Thread tito
On Fri, 21 Jul 2023 21:39:57 +0200
"Roberto A. Foglietta"  wrote:

> To the maintainers and everyone else whom can be interested,
> 
>  today I developed for a mobile device running a Linux kernel and
> related GNU system (almost) a quite promising version of 'strings'
> faster than the original and giving out the same output with the same
> syntax. Good for me!
> 
>  I hope it will be good enough for you also. However, before starting
> the integration with busybox I wish to know if this applet and/or its
> implementation or an improved one, would have a good chance and
> appreciation to enter in the busybox development stream.
> 
>  Here in the attachment I put the .c file in GPLv2, like busybox. In
> its header just a raw estimation of its performance that can vary from
> system to system and among different platforms but at least comparable
> with the original from binutils.
> 
>  Best regards, R-

Hi,
seems to me that the current strings busybox implementation is faster.

tito@devuan:~/Desktop$ time ./strings vmlinux.o > out.txt

real0m0.369s
user0m0.365s
sys 0m0.004s
tito@devuan:~/Desktop$ time ./busybox strings vmlinux.o > out1.txt

real0m0.289s
user0m0.277s
sys 0m0.012s

tito@devuan:~/Desktop$ rm out*.txt
tito@devuan:~/Desktop$ time ./strings vmlinux.o > out.txt

real0m0.355s
user0m0.347s
sys 0m0.008s
tito@devuan:~/Desktop$ time ./busybox strings vmlinux.o > out1.txt

real0m0.291s
user0m0.286s
sys 0m0.004s

tito@devuan:~/Desktop$ rm out*.txt
tito@devuan:~/Desktop$ time ./strings vmlinux.o > out.txt

real0m0.356s
user0m0.335s
sys 0m0.020s
tito@devuan:~/Desktop$ time ./busybox strings vmlinux.o > out1.txt

real0m0.291s
user0m0.274s
sys 0m0.016s

Ciao,
Tito

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox