Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Riku Voipio wrote: I think nofpu would good for raspian. Any lost audio quality would unnoticable on the Rasberry's analog audio output ;) Peter, what's the recommended way to recognize raspbian in debian/rules ? dpkg-vendor --derives-from raspbian -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/531922fc.40...@p10link.net
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Tue, Mar 04, 2014 at 11:49:45AM +0100, Thomas Orgis wrote: In any case ... Riku: Care to run timings of MAD on your configurations? I'm interested in how fast it is producing that 24 bit output on limited CPUs. time madplay -d -o null: convergence_-_points_of_view/*.mp3 /dev/null Cortex A15: real0m33.154s user0m33.045s sys 0m0.110s ARMv5: real1m35.923s user1m18.290s sys 0m0.070s Seems mpg123 wins bragging rights :) thanks, awesome work! Riku -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20140305091407.ga16...@afflict.kos.to
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Tue, Mar 04, 2014 at 02:59:44AM +, peter green wrote: On Sun, Mar 02, 2014 at 09:06:44AM -0500, Reinhard Tartler wrote: That sounds like if the mpg123 package should use: on armel: --with-cpu=arm_nofpu on armhf: --with-cpu=arm_fpu Does this make sense to everybody? Seems sane to me. armv7 devices without neon are relatively uncommon so while it's important that they are supported it's IMO not vitally important to squeeze out every last drop of performance from them. I wonder what we should use on raspbian? I haven't tested on a Pi yet but it seems that on all tests i've seen so-far the generic fpu code is quite a bit slower than the arm nofpu code. Is there any quality difference from using a fpu vs nonfpu decoder? If so how much performance degredation do you beleive should be accepted in exchange for that quality improvement. I think nofpu would good for raspian. Any lost audio quality would unnoticable on the Rasberry's analog audio output ;) Peter, what's the recommended way to recognize raspbian in debian/rules ? Riku -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20140305093430.gb16...@afflict.kos.to
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Tue, 04 Mar 2014 02:59:44 + schrieb peter green plugw...@p10link.net: Is there any quality difference from using a fpu vs nonfpu decoder? Technically, there is. See those numbers for generic fpu and non-fpu code with and without --enable-int-quality given to configure (enables better rounding for small performance hit, you might want to activate that by default). In numbers, the difference is this: == src/mpg123.fpu_accurate.compliance.txt == Layer 3 -- 16 bit signed integer output compl.bit: RMS=4.300914e-06 (PASS) maxdiff=7.688999e-06 (PASS) -- 32 bit integer output compl.bit: RMS=2.152784e-08 (PASS) maxdiff=1.769513e-07 (PASS) -- 24 bit integer output compl.bit: RMS=4.206462e-08 (PASS) maxdiff=1.788139e-07 (PASS) -- 32 bit floating point output compl.bit: RMS=2.153045e-08 (PASS) maxdiff=1.769513e-07 (PASS) == src/mpg123.fpu.compliance.txt == Layer 3 -- 16 bit signed integer output compl.bit: RMS=8.907757e-06 (LIMITED) maxdiff=1.531839e-05 (PASS) -- 32 bit integer output compl.bit: RMS=2.152589e-08 (PASS) maxdiff=1.769513e-07 (PASS) -- 24 bit integer output compl.bit: RMS=4.205495e-08 (PASS) maxdiff=1.788139e-07 (PASS) -- 32 bit floating point output compl.bit: RMS=2.153045e-08 (PASS) maxdiff=1.769513e-07 (PASS) == src/mpg123.nofpu_accurate.compliance.txt == Layer 3 -- 16 bit signed integer output compl.bit: RMS=4.344827e-06 (PASS) maxdiff=1.275539e-05 (PASS) -- 32 bit integer output compl.bit: RMS=4.344827e-06 (PASS) maxdiff=1.275539e-05 (PASS) -- 24 bit integer output compl.bit: RMS=4.344827e-06 (PASS) maxdiff=1.275539e-05 (PASS) -- 32 bit floating point output compl.bit: RMS=4.344827e-06 (PASS) maxdiff=1.275539e-05 (PASS) == src/mpg123.nofpu.compliance.txt == Layer 3 -- 16 bit signed integer output compl.bit: RMS=7.927192e-06 (PASS) maxdiff=2.676249e-05 (PASS) -- 32 bit integer output compl.bit: RMS=7.927192e-06 (PASS) maxdiff=2.676249e-05 (PASS) -- 24 bit integer output compl.bit: RMS=7.927192e-06 (PASS) maxdiff=2.676249e-05 (PASS) -- 32 bit floating point output compl.bit: RMS=7.927192e-06 (PASS) maxdiff=2.676249e-05 (PASS) With a nofpu decoder, you always get the precision of 16 bit output, because floating point numbers are converted from 16 bit. But, especially so with --enable-int-quality, this is a fully compliante MPEG audio decoder with all the precision that you need for normal playback situations. MAD claims 24 bit precision with integer math (just about matching mpg123's 24 bit output with FPU decoder, see http://www.underbit.com/resources/mpeg/audio/compliance, RMS=4.906e−08) I suspect though, that MAD will be considerably slower than mpg123's arm_nofpu decoder. On my Core2Duo P8800, madplay with libmad 0.15.1 needs about 7.4 s to 8.5 s decoding to null output (with either speed or accuracy optimization). The mpg123 numbers for the generic variants (accurate == --enable-int-quality): == src/mpg123.fpu_accurate.bench.txt == #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 6.165.85 == src/mpg123.fpu.bench.txt == #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 6.055.83 == src/mpg123.nofpu_accurate.bench.txt == #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 6.676.81 == src/mpg123.nofpu.bench.txt == #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 6.016.16 You see, there is some hit from accurate rounding, but it is in a different league compared to the difference between fpu and nofpu on a NEON-less ARM device (and yes, on a x86 CPU, generic FPU code is faster when actually proucing float output). Oh, and remember: This is for mpg123 with handbrakes on, using Taihei's assembly optimizations, the decoding time is about halved on the Core2. Similarily, I'd like to see numbers for madplay on ARM (best on machines with and without fpu to get a picture about what difference we talk about): sh$ time -d -o null convergence_-_points_of_view/*.mp3 I don't know offhand how mpg123 nofpu stacks up against that, but there should be a considerable difference in speed. My guess is that, on limited hardware without NEON, you'd prefer stutter-free playback with least CPU power draw. When utmost theoretical quality really matters or you intend extensive post-processing of the data --- especially using an audio player that works with floating point math internally, like audacious --- then employing a more capable CPU with NEON is something I expect. The mpg123 nofpu decoder, according Riku's numbers, is still a good choice for systems with a FPU but no NEON, but the generic floating point decoder is not that far behind in speed (compared to softfloat) and offers proper floating point accuracy as bonus. Generally, it is a safe bet that any normal
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Tue, 4 Mar 2014 11:49:45 +0100 schrieb Thomas Orgis thomas-fo...@orgis.org: sh$ time -d -o null convergence_-_points_of_view/*.mp3 That should be sh$ time madplay -d -o null: convergence_-_points_of_view/*.mp3 ... as you may have guessed (notice the added :). Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Mon, Mar 3, 2014 at 11:59 PM, peter green plugw...@p10link.net wrote: wonder what we should use on raspbian? I haven't tested on a Pi yet but it seems that on all tests i've seen so-far the generic fpu code is quite a bit slower than the arm nofpu code. Indeed, it seems to be: == felipe@felipepi:mpg123-20140302115523-nofpu% ./scripts/benchmark-cpu.pl src/mpg123 ../convergence_-_points_of_view/*.mp3 Found 1 CPU optimizations to test... #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s ARM 86.26 90.66 = felipe@felipepi:mpg123-20140302115523% ./scripts/benchmark-cpu.pl src/mpg123 ../convergence_-_points_of_view/*.mp3 Found 2 CPU optimizations to test... #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 102.80 100.06 generic_dither 121.10 100.84 = -- Saludos, Felipe Sateler -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/caafdzj8kwhvhz62pwgrx9xepzc_zyzqr9gf4kozlluwnq6b...@mail.gmail.com
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Tue, Mar 04, 2014 at 02:59:44AM +, peter green wrote: Seems sane to me. armv7 devices without neon are relatively uncommon so while it's important that they are supported it's IMO not vitally important to squeeze out every last drop of performance from them. I don't agree. At least the fisrt Tegra chips did not have neon, and the marvell chips often don't have neon (the newer ones are starting to now that they are moving to using Cortex-A designs, rather than marvell custom cores (like the JP4 used in the armada 510 in the cubox for example), but many chips don't have neon). Do the qualcomm designs have neon? I have been mostly ignoring them due to the anti open source attitude of qualcomm. If with=arm_fpu auto selects neon or VFP3 automatically, then I think armhf is perfect for all armv7 devices. I wonder what we should use on raspbian? I haven't tested on a Pi yet but it seems that on all tests i've seen so-far the generic fpu code is quite a bit slower than the arm nofpu code. Is there any quality difference from using a fpu vs nonfpu decoder? If so how much performance degredation do you beleive should be accepted in exchange for that quality improvement. So VFP2 is slower than interger math? Interesting. IMO it's often better to be explicit about this sort of thing. While upstreams defaults may align with debian armhf's requirements at the present time and on the present build hardware such defaults are subject to change either as a result of upstream changes in new versions or as a result of different build hardware. I suppose that makes sense. Avoids unexpected surprises later. -- Len Sorensen -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20140304155447.gv17...@csclub.uwaterloo.ca
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Tue, 4 Mar 2014 11:10:25 -0300 schrieb Felipe Sateler fsate...@debian.org: #decodert_s16/s t_f32/s ARM 86.26 90.66 generic 102.80 100.06 generic_dither 121.10 100.84 Yes, a difference, but aguably a lot less than comparing VPU code to NEON. With the feature to produce float output from all decoders, it is your (debian's) option to prefer decoding speed by building a libmpg123 with arm_nofpu and use it on armhf machines without NEON via the library loading mechanism. Or you decide for offering proper floating point output that needs some 25-50 % more CPU time. I am even more interested in a comparison with the runtime of madplay in that configuration. Perhaps its fixed-point math with 24 bit output is still faster than using the VFP with mpg123. Of course, I'd be interested to know if that's not the case (mpg123 rulez!;-). But if it is, it wouldn't totally surprise me. Alrighty then, Thomas PS: You still have to decide for --enable-int-quality or not, for a smaller impact on CPU time and basically one bit of precision. signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Tue, Mar 4, 2014 at 2:26 PM, Thomas Orgis thomas-fo...@orgis.org wrote: Am Tue, 4 Mar 2014 11:10:25 -0300 schrieb Felipe Sateler fsate...@debian.org: #decodert_s16/s t_f32/s ARM 86.26 90.66 generic 102.80 100.06 generic_dither 121.10 100.84 Yes, a difference, but aguably a lot less than comparing VPU code to NEON. With the feature to produce float output from all decoders, it is your (debian's) option to prefer decoding speed by building a libmpg123 with arm_nofpu and use it on armhf machines without NEON via the library loading mechanism. Or you decide for offering proper floating point output that needs some 25-50 % more CPU time. I am even more interested in a comparison with the runtime of madplay in that configuration. Perhaps its fixed-point math with 24 bit output is still faster than using the VFP with mpg123. Of course, I'd be interested to know if that's not the case (mpg123 rulez!;-). But if it is, it wouldn't totally surprise me. madplay -d -o null: convergence_-_points_of_view/*.mp3 /dev/null 130.22s user 1.88s system 93% cpu 2:21.91 total That's with the following mad: MPEG Audio Decoder 0.15.1 (beta) Copyright (C) 2000-2004 Underbit Technologies, Inc. Build options: NDEBUG FPM_ARM ASO_IMDCT ASO_INTERLEAVE1 ID3 Tag Library 0.15.1 (beta) Copyright (C) 2000-2004 Underbit Technologies, Inc. Build options: NDEBUG madplay 0.15.2 (beta) Copyright (C) 2000-2004 Robert Leslie Build options: AUDIO_DEFAULT=audio_alsa ENABLE_NLS This is the madplay straight from raspbian, not sure if some other configure flag was to be tested. -- Saludos, Felipe Sateler -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/caafdzj-qw8nx-4gujcj+kvtn9lz76mp1tcnaszh1tdzkftq...@mail.gmail.com
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Tue, 4 Mar 2014 16:25:17 -0300 schrieb Felipe Sateler fsate...@debian.org: #decodert_s16/s t_f32/s ARM 86.26 90.66 generic 102.80 100.06 generic_dither 121.10 100.84 madplay -d -o null: convergence_-_points_of_view/*.mp3 /dev/null 130.22s user 1.88s system 93% cpu 2:21.91 total Interesting. So the VFP is not that bad: You get superior output (not noticeably, but measurable in the digital domain) from mpg123's generic decoder in about 75 % of the decoding time. The lower-quality 16 bit integer decoder of mpg123 is considerably faster. So, on a armel system without VFP, it makes sense to employ libmad to achieve 24 bit accuracy with reasonable CPU cost, if you insist on that accuracy. But with VFP, using mpg123 gives you full 32 bit floating point output with less CPU load. For NEON, it's not even a question. I think I can live with that situation;-) both MAD and mpg123 achieve their goals. MAD gets the best precision out of integer math, mpg123 offers something faster everywhere, possibly with less, but also possibly with more (irrelevant, 24 bit is _really_ enough) precision. One might also benchmark a decoder based on ffmpeg, which has both fixed-point and floating-point decoders, but I don't have a good command line for that at hand (used mplayer -ac mpg123 and mplayer -ac ffmp3[float] in the past). Anyhow, leaving scope here. I should get going and release mpg123 1.19.0 . This is the madplay straight from raspbian, not sure if some other configure flag was to be tested. Optimizing for speed vs. quality might be an option ... but that's somehow missing the point of preferring libmad. Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Sun, Mar 02, 2014 at 12:02:40PM +0100, Thomas Orgis wrote: Am Sat, 01 Mar 2014 01:00:02 +0900 schrieb Taihei Momma t...@mac.com: OK, after some investigation with armhf cross environment and qemu, finally the current mpg123 svn (r3517) should work After Tahei didn't stop at this (big thanks from here!), we got a new snapshot, http://mpg123.org/snapshot/mpg123-20140302115523.tar.bz2 , that will hopefully become mpg123 1.19.0 soon (not 1.18.x because of feature additions regarding this very debian issue). The main points: - float output with all decoders (also arm_nofpu) - ARM decoders (esp. NEON) working with debian toolchain - new --with-cpu=arm_fpu choice with runtime detection to switch between NEON or normal FPU So, the number of builds for optimal treatment of differing platforms reduces to two: 1. --with-cpu=arm_nofpu 2. --with-cpu=arm_fpu Awesome work! I hope we can all be happy about that. I'd also be glad to get some confirmation from debian that it really works now. Release will be imminent, then. Here's some test results On a cortex-a15 system arm_nofpu: (ubuntu armhf) #decodert_s16/s t_f32/s ARM 24.22 25.02 On a cortex-a15 system arm_fpu: (ubuntu armhf) #decodert_s16/s t_f32/s NEON14.33 14.90 generic 36.25 27.46 generic_dither 39.52 27.44 the A15 core was downclocked and cpufreq disabled to ensure stable results ARMv5 system arm_nofpu (debian armel) #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s ARM 49.12 63.17 ARMv5 system arm_fpu (debian sid) #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 491.75 468.37 generic_dither 535.50 468.38 armel is with softfloat emulation, so horrible times were expected - the main point of that last run was to verify that NEON runtime detection works (Seems so). Riku -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20140303085058.ga1...@afflict.kos.to
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Sun, Mar 02, 2014 at 12:02:40PM +0100, Thomas Orgis wrote: After Tahei didn't stop at this (big thanks from here!), we got a new snapshot, http://mpg123.org/snapshot/mpg123-20140302115523.tar.bz2 , that will hopefully become mpg123 1.19.0 soon (not 1.18.x because of feature additions regarding this very debian issue). The main points: - float output with all decoders (also arm_nofpu) - ARM decoders (esp. NEON) working with debian toolchain - new --with-cpu=arm_fpu choice with runtime detection to switch between NEON or normal FPU So, the number of builds for optimal treatment of differing platforms reduces to two: 1. --with-cpu=arm_nofpu 2. --with-cpu=arm_fpu I hope we can all be happy about that. I'd also be glad to get some confirmation from debian that it really works now. Release will be imminent, then. Thanks for staying with us with all the chattering about this ... I now see (with arm_fpu of course, which it seems to have auto detected correctly): perl scripts/benchmark-cpu.pl `which mpg123` /convergence_-_points_of_view/*mp3 Found 3 CPU optimizations to test... #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s NEON7.587.84 generic 19.23 14.56 generic_dither 20.97 14.54 Looks good. I ran it 3 times and they were very close, and the cpu pinned itself at 1.5GHz during the test, and went back to 1.0GHz when idle again. One of the two cores was very bored though with nothing to do. -- Len Sorensen -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20140303170248.gt17...@csclub.uwaterloo.ca
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Sun, Mar 02, 2014 at 09:06:44AM -0500, Reinhard Tartler wrote: That sounds like if the mpg123 package should use: on armel: --with-cpu=arm_nofpu on armhf: --with-cpu=arm_fpu Does this make sense to everybody? I think so. armhf's current debian rules automatically picked arm_fpu with the new version's configure script, so at least that one doesn't seem to need any explicit help. armel might though. Thank you for handling this issue (and basically every issue other that popped out in Debian for mpg123) so quickly! -- Len Sorensen -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20140303170444.gu17...@csclub.uwaterloo.ca
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Sun, Mar 02, 2014 at 09:06:44AM -0500, Reinhard Tartler wrote: That sounds like if the mpg123 package should use: on armel: --with-cpu=arm_nofpu on armhf: --with-cpu=arm_fpu Does this make sense to everybody? Seems sane to me. armv7 devices without neon are relatively uncommon so while it's important that they are supported it's IMO not vitally important to squeeze out every last drop of performance from them. I wonder what we should use on raspbian? I haven't tested on a Pi yet but it seems that on all tests i've seen so-far the generic fpu code is quite a bit slower than the arm nofpu code. Is there any quality difference from using a fpu vs nonfpu decoder? If so how much performance degredation do you beleive should be accepted in exchange for that quality improvement. Lennart Sorensen wrote: I think so. armhf's current debian rules automatically picked arm_fpu with the new version's configure script, so at least that one doesn't seem to need any explicit help. armel might though. IMO it's often better to be explicit about this sort of thing. While upstreams defaults may align with debian armhf's requirements at the present time and on the present build hardware such defaults are subject to change either as a result of upstream changes in new versions or as a result of different build hardware. -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/531541a0.3010...@p10link.net
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Sat, 01 Mar 2014 01:00:02 +0900 schrieb Taihei Momma t...@mac.com: OK, after some investigation with armhf cross environment and qemu, finally the current mpg123 svn (r3517) should work After Tahei didn't stop at this (big thanks from here!), we got a new snapshot, http://mpg123.org/snapshot/mpg123-20140302115523.tar.bz2 , that will hopefully become mpg123 1.19.0 soon (not 1.18.x because of feature additions regarding this very debian issue). The main points: - float output with all decoders (also arm_nofpu) - ARM decoders (esp. NEON) working with debian toolchain - new --with-cpu=arm_fpu choice with runtime detection to switch between NEON or normal FPU So, the number of builds for optimal treatment of differing platforms reduces to two: 1. --with-cpu=arm_nofpu 2. --with-cpu=arm_fpu I hope we can all be happy about that. I'd also be glad to get some confirmation from debian that it really works now. Release will be imminent, then. Thanks for staying with us with all the chattering about this ... Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Sun, Mar 2, 2014 at 6:02 AM, Thomas Orgis thomas-fo...@orgis.org wrote: Am Sat, 01 Mar 2014 01:00:02 +0900 schrieb Taihei Momma t...@mac.com: OK, after some investigation with armhf cross environment and qemu, finally the current mpg123 svn (r3517) should work After Tahei didn't stop at this (big thanks from here!), we got a new snapshot, http://mpg123.org/snapshot/mpg123-20140302115523.tar.bz2 , that will hopefully become mpg123 1.19.0 soon (not 1.18.x because of feature additions regarding this very debian issue). The main points: - float output with all decoders (also arm_nofpu) - ARM decoders (esp. NEON) working with debian toolchain - new --with-cpu=arm_fpu choice with runtime detection to switch between NEON or normal FPU So, the number of builds for optimal treatment of differing platforms reduces to two: 1. --with-cpu=arm_nofpu 2. --with-cpu=arm_fpu I hope we can all be happy about that. I'd also be glad to get some confirmation from debian that it really works now. Release will be imminent, then. That sounds like if the mpg123 package should use: on armel: --with-cpu=arm_nofpu on armhf: --with-cpu=arm_fpu Does this make sense to everybody? Thanks for staying with us with all the chattering about this ... Thank you for handling this issue (and basically every issue other that popped out in Debian for mpg123) so quickly! -- regards, Reinhard -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/caj0cceaxdetftr8svyg7vnvkgvxcy80xrxt32ljtmxl_pfo...@mail.gmail.com
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Sat, 01 Mar 2014 01:00:02 +0900 schrieb Taihei Momma t...@mac.com: OK, after some investigation with armhf cross environment and qemu, finally the current mpg123 svn (r3517) should work (including arm_nofpu decoder). The point is .type directive. Without this directive, a linker doesn't distinguish arm functions from thumb functions, and interworking doesn't work properly. Great! So, folks, please check that http://mpg123.de/snapshot/mpg123-2014030100.tar.bz2 does the trick with all decoders now. Performance numbers from the benchmark script would be nice. I'll release 1.18.1 after confirmation and we finally can settle this. Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Sat, 1 Mar 2014 09:56:46 +0100 schrieb Thomas Orgis thomas-fo...@orgis.org: Great! So, folks, please check that http://mpg123.de/snapshot/mpg123-2014030100.tar.bz2 does the trick with all decoders now. Performance numbers from the benchmark script would be nice. I'll release 1.18.1 after confirmation Sorry, I meant 1.18.2, of course. Also, I fixed the benchmark script to check the return value with http://mpg123.de/snapshot/mpg123-20140301101020.tar.bz2 just in case things are still broken. Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
OK, after some investigation with armhf cross environment and qemu, finally the current mpg123 svn (r3517) should work (including arm_nofpu decoder). The point is .type directive. Without this directive, a linker doesn't distinguish arm functions from thumb functions, and interworking doesn't work properly. Regards, Taihei Momma -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/c4774c20-17e6-47a5-8cb2-c71dbeff3...@mac.com
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Mon, 24 Feb 2014 12:27:36 -0500 schrieb Lennart Sorensen lsore...@csclub.uwaterloo.ca: Any help from this: Program received signal SIGILL, Illegal instruction. 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:48 48 vpush {q4-q7} What the ... ? This does not make sense. I (and actually, with I, I mean Taihei who knows more about ARM assembly;-). The vpush pseudo instruction should be harmless in our context. Quote from Taihei: I don't know why. Actually vpush is a pseudo instruction, and vpush {q4-q7} should be assembled into vstmdb sp!, {d8-d15} (machine code is ed2d8b10). I'm curious how their assembler (gnu as?) assembles into. Well ... what does sh$ objdump -S src/libmpg132/.libs/dct64_neon.o say? Any hint from the debian ARM folks with experience about funny behaviour for stand-alone assembly files? I also wonder if this is generally broken on debian (since certain toolchain version) or on certain CPUs only. I repeat: This code worked before: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=667653#35 Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Wait, code alignment issue? #0 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:48 ^ not a multiple of 4. I've just committed a fix to mpg123 repository to align the function by 4 bytes. I supposed this was fixed before, but actually dct64 part was omitted: http://www.mpg123.de/cgi-bin/scm/mpg123?view=revisionrevision=3003 I hope this should fix the SIGILL issue. Regards, Taihei Momma -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/b4a4b91e-2ab4-447f-835c-3be85d411...@mac.com
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Tue, 25 Feb 2014 17:37:41 +0900 schrieb Taihei Momma t...@mac.com: #0 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:48 ^ not a multiple of 4. Oh, d'oh! It could be that simple. I've just committed a fix to mpg123 repository I generated a new snapshot, http://mpg123.org/snapshot/mpg123-20140225111416.tar.bz2 , and also attached the patch for the rather small change that hopefully has a big effect. Care to test this? Alrighty then, Thomas -- Thomas Orgis - Source Mage GNU/Linux Developer (http://www.sourcemage.org) OrgisNetzOrganisation ---)=- http://orgis.org GPG public key D446D524: http://thomas.orgis.org/public_key Fingerprint: 7236 3885 A742 B736 E0C8 9721 9B4C 52BC D446 D524 Index: src/libmpg123/dct64_neon_float.S === --- src/libmpg123/dct64_neon_float.S (Revision 3514) +++ src/libmpg123/dct64_neon_float.S (Revision 3515) @@ -44,6 +44,7 @@ .word 1060439283 .word 1060439283 .globl ASM_NAME(dct64_real_neon) + ALIGN4 ASM_NAME(dct64_real_neon): vpush {q4-q7} Index: src/libmpg123/dct64_neon.S === --- src/libmpg123/dct64_neon.S (Revision 3514) +++ src/libmpg123/dct64_neon.S (Revision 3515) @@ -44,6 +44,7 @@ .word 1060439283 .word 1060439283 .globl ASM_NAME(dct64_neon) + ALIGN4 ASM_NAME(dct64_neon): vpush {q4-q7} signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Tue, Feb 25, 2014 at 11:18:50AM +0100, Thomas Orgis wrote: Am Tue, 25 Feb 2014 17:37:41 +0900 schrieb Taihei Momma t...@mac.com: #0 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:48 ^ not a multiple of 4. Oh, d'oh! It could be that simple. I've just committed a fix to mpg123 repository I generated a new snapshot, http://mpg123.org/snapshot/mpg123-20140225111416.tar.bz2 , and also attached the patch for the rather small change that hopefully has a big effect. Care to test this? Alrighty then, Thomas -- Thomas Orgis - Source Mage GNU/Linux Developer (http://www.sourcemage.org) OrgisNetzOrganisation ---)=- http://orgis.org GPG public key D446D524: http://thomas.orgis.org/public_key Fingerprint: 7236 3885 A742 B736 E0C8 9721 9B4C 52BC D446 D524 Index: src/libmpg123/dct64_neon_float.S === --- src/libmpg123/dct64_neon_float.S (Revision 3514) +++ src/libmpg123/dct64_neon_float.S (Revision 3515) @@ -44,6 +44,7 @@ .word 1060439283 .word 1060439283 .globl ASM_NAME(dct64_real_neon) + ALIGN4 ASM_NAME(dct64_real_neon): vpush {q4-q7} Index: src/libmpg123/dct64_neon.S === --- src/libmpg123/dct64_neon.S(Revision 3514) +++ src/libmpg123/dct64_neon.S(Revision 3515) @@ -44,6 +44,7 @@ .word 1060439283 .word 1060439283 .globl ASM_NAME(dct64_neon) + ALIGN4 ASM_NAME(dct64_neon): vpush {q4-q7} root@rceng05:/mpg123-20140225111416# gdb --args /tmp/mpginst/usr/local/bin/mpg123 -e s16 -q --cpu NEON -t /convergence_-_points_of_view/*mp3 GNU gdb (GDB) 7.6.2 (Debian 7.6.2-1) Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type show copying and show warranty for details. This GDB was configured as arm-linux-gnueabihf. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /tmp/mpginst/usr/local/bin/mpg123...done. (gdb) run Starting program: /tmp/mpginst/usr/local/bin/mpg123 -e s16 -q --cpu NEON -t /convergence_-_points_of_view/01\ -\ Bleed.mp3 /convergence_-_points_of_view/02\ -\ Strike\ the\ end.mp3 /convergence_-_points_of_view/03\ -\ Listen.mp3 /convergence_-_points_of_view/04\ -\ Six\ feet\ under.mp3 /convergence_-_points_of_view/05\ -\ Always\ the\ same.mp3 /convergence_-_points_of_view/06\ -\ Breath.mp3 /convergence_-_points_of_view/07\ -\ Vanished\ memories.mp3 /convergence_-_points_of_view/08\ -\ Silent.mp3 /convergence_-_points_of_view/09\ -\ Nothing\ else.mp3 /convergence_-_points_of_view/10\ -\ Train\ to\ leave.mp3 Program received signal SIGILL, Illegal instruction. 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:49 49 vpush {q4-q7} (gdb) disassemble Dump of assembler code for function INT123_dct64_neon: 0xb6fb9330 +0: vpush {d8-d15} 0xb6fb9334 +4: sub r3, pc, #140; 0x8c 0xb6fb9338 +8: vld1.32 {d0-d3}, [r2]! 0xb6fb933c +12:vld1.32 {d4-d7}, [r2]! 0xb6fb9340 +16:vld1.32 {d8-d11}, [r2]! 0xb6fb9344 +20:vld1.32 {d12-d15}, [r2] 0xb6fb9348 +24:vld1.32 {d24-d27}, [r3 :128]! 0xb6fb934c +28:vld1.32 {d28-d31}, [r3 :128]! 0xb6fb9350 +32:vrev64.32 q4, q4 0xb6fb9354 +36:vrev64.32 q5, q5 0xb6fb9358 +40:vrev64.32 q6, q6 0xb6fb935c +44:vrev64.32 q7, q7 0xb6fb9360 +48:vswpd8, d9 0xb6fb9364 +52:vswpd10, d11 0xb6fb9368 +56:vswpd12, d13 0xb6fb936c +60:vswpd14, d15 0xb6fb9370 +64:vsub.f32q8, q0, q7 0xb6fb9374 +68:vsub.f32q9, q1, q6 0xb6fb9378 +72:vsub.f32q10, q2, q5 0xb6fb937c +76:vsub.f32q11, q3, q4 0xb6fb9380 +80:vadd.f32q0, q0, q7 0xb6fb9384 +84:vadd.f32q1, q1, q6 0xb6fb9388 +88:vadd.f32q2, q2, q5 0xb6fb938c +92:vadd.f32q3, q3, q4 0xb6fb9390 +96:vmul.f32q4, q8, q12 0xb6fb9394 +100: vmul.f32q5, q9, q13 0xb6fb9398 +104: vmul.f32q6, q10, q14 0xb6fb939c +108: vmul.f32q7, q11, q15 0xb6fb93a0 +112: vld1.32 {d24-d27}, [r3 :128]! 0xb6fb93a4 +116: vld1.32 {d28-d31}, [r3 :128] 0xb6fb93a8 +120: vrev64.32 q2, q2 0xb6fb93ac +124: vrev64.32 q3, q3 0xb6fb93b0 +128: vrev64.32 q6, q6 0xb6fb93b4 +132: vrev64.32 q7, q7 0xb6fb93b8 +136: vswpd4, d5 0xb6fb93bc +140: vswpd6, d7 0xb6fb93c0 +144: vswpd12, d13 0xb6fb93c4 +148: vswpd14, d15 0xb6fb93c8 +152: vsub.f32q8, q0, q3 0xb6fb93cc +156: vsub.f32q9, q1, q2
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Tue, 25 Feb 2014 11:20:06 -0500 schrieb Lennart Sorensen lsore...@csclub.uwaterloo.ca: On Tue, Feb 25, 2014 at 11:18:50AM +0100, Thomas Orgis wrote: Am Tue, 25 Feb 2014 17:37:41 +0900 schrieb Taihei Momma t...@mac.com: #0 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:48 ^ not a multiple of 4. Index: src/libmpg123/dct64_neon.S === --- src/libmpg123/dct64_neon.S (Revision 3514) +++ src/libmpg123/dct64_neon.S (Revision 3515) @@ -44,6 +44,7 @@ .word 1060439283 .word 1060439283 .globl ASM_NAME(dct64_neon) + ALIGN4 ASM_NAME(dct64_neon): vpush {q4-q7} Now ... Program received signal SIGILL, Illegal instruction. 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:49 49 vpush {q4-q7} That address didn't change. I suggest we better align the function symbol itself, seems like we accidentally missed by one line: ALIGN4 .globl ASM_NAME(dct64_neon) ASM_NAME(dct64_neon): looks better to me (at least that's how we did it for all other functions;-). Care to test the current http://mpg123.org/snapshot/mpg123-20140225173909.tar.bz2 ? Sorry for the inconvenience, but I don't have a setup handy to test this myself. Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Wed, Feb 26, 2014 at 01:59:12AM +0900, Taihei Momma wrote: On 2014/02/26, at 1:44, Thomas Orgis wrote: That address didn't change. Well, the function itself is properly aligned (so my fix didn't take effect anyway). 0xb6fb9330 +0: vpush {d8-d15} 0xb6fb9334 +4: sub r3, pc, #140; 0x8c But the processor decoded the first instruction as 2-byte (thumb?), then increased PC by 2. And it raised SIGILL at 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:49 So, I guess - assembler emits a bad machine code for vpush or - kernel is not configured properly to run vfp instructions Is that a kernel option? I wouldn't have thought armhf would run without that (unless no floating point code is every being run). Well the kernel that is running has this: CONFIG_VFP=y CONFIG_VFPv3=y CONFIG_NEON=y I'd like to look into objdump -d result to check the machine code. Remember Debian armhf is -mthumb by default. Any assembly code needs to be properly flagged with .arm, or .syntax unified or whatever is appropriate (still trying to wrap my head around this myself). That is if the assembly code is written in arm rather than thumb2 assembly. At least that's my understanding so far. If I add .syntax unified and .fpu neon, then I no longer have to pass -mfpu neon to the CFLAGS to get it to compile, but it still fails. I am just about to test the new version to see if that helps anything. The disassembly in gcc shows 4 byte alignment, but the address of the illegal instruction is 2 bytes past the vpush instruction's address. In fact if I add -marm to the CFLAGS, then it seems to work, so the .S files are not being flagged correctly as being arm code, or they are missing thumb interworking bits or something. root@rceng05:/mpg123-20140225173909# perl scripts/benchmark-cpu.pl src/mpg123 /convergence_-_points_of_view/*mp3 Found 1 CPU optimizations to test... #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s NEON7.527.65 That was with CFLAGS=-g -mcpu=cortex-a15 -mfpu=neon -marm Without -marm, it crashes with illegal instruction. But since -mthumb is the default on armhf, then passing -marm seems wrong. -- Len Sorensen -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20140225174228.gm17...@csclub.uwaterloo.ca
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Taihei Momma wrote: But the processor decoded the first instruction as 2-byte (thumb?), Note that debian armhf builds C code in thumb2 mode by default. -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/530cd784.8010...@p10link.net
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Sat, Feb 22, 2014 at 10:05:35AM +0100, Thomas Orgis wrote: Am Fri, 21 Feb 2014 11:25:12 -0500 schrieb Lennart Sorensen lsore...@csclub.uwaterloo.ca: Testing with the neon build I get a return code of 4, and it seems to be failing to run. It was a pain to even get it to compile. Using just the configure option, the assembler complained about the NEON instructions being invalid for the chosen cpu type. Adding -mfpu=neon to the CFLAGS made it able to compile, but it still crashes with illegal instruction. I tried with CFLAGS set to -mcpu=cortex-a15 -mfpu=neon, and that still gives illegal instruction when running it. This is weird. What happened in debian side since https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=667653#35 ? We have the current code working on this setup: device: iPod touch 4G with iOS 5.1.1 toolchain: gcc 4.2.1(from Xcode 3.2.6) on OSX 10.6.8, clang 3.3(from Xcode 5.0.2) on OSX 10.9.1 (double checked) configure script option: --host=armv7-apple-darwin --with-cpu=arm_nofpu[neon] --with-audio=dummy --disable-shared --enable-static [--enable-int-quality] Taihei also just checked the compliance of the decoder choices including NEON. That illegal instruction ... care to fire up the debugger to tell us where it actually occurs? The NEON assembly is written as plain assembler input (cpp + as), you can see the instructions we use right there and it doesn't differ from iOS. It might be a good idea to have the benchmark script actuall check the return code of system() Yes. I was building and testing under Debian armhf sid. gcc (Debian 4.8.2-16) 4.8.2 CPU is a dual Cortex-A15 1.5GHz (TI OMAP 57xx). Alrighty then, Any help from this: (gdb) run Starting program: /tmp/mpginst/usr/local/bin/mpg123 -e s16 -q --cpu NEON -t /convergence_-_points_of_view/01\ -\ Bleed.mp3 /convergence_-_points_of_view/02\ -\ Strike\ the\ end.mp3 /convergence_-_points_of_view/03\ -\ Listen.mp3 /convergence_-_points_of_view/04\ -\ Six\ feet\ under.mp3 /convergence_-_points_of_view/05\ -\ Always\ the\ same.mp3 /convergence_-_points_of_view/06\ -\ Breath.mp3 /convergence_-_points_of_view/07\ -\ Vanished\ memories.mp3 /convergence_-_points_of_view/08\ -\ Silent.mp3 /convergence_-_points_of_view/09\ -\ Nothing\ else.mp3 /convergence_-_points_of_view/10\ -\ Train\ to\ leave.mp3 Program received signal SIGILL, Illegal instruction. 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:48 48 vpush {q4-q7} (gdb) where #0 0xb6fb9332 in INT123_dct64_neon () at dct64_neon.S:48 #1 0xb6fab71c in INT123_synth_1to1_stereo_neon (bandPtr_l=optimized out, bandPtr_r=0x36400, fr=0x291d8) at synth.c:892 #2 0xb6fb8328 in INT123_do_layer3 (fr=0x291d8) at layer3.c:2060 #3 0xb6fa725e in decode_the_frame (fr=fr@entry=0x291d8) at libmpg123.c:699 #4 0xb6fa823e in mpg123_decode_frame_64 (mh=0x291d8, num=num@entry=0x28490 framenum, audio=audio@entry=0xbefff8e8, bytes=bytes@entry=0xbefff8f0) at libmpg123.c:838 #5 0x00012fce in play_frame () at mpg123.c:667 #6 0xb2f0 in main (sys_argc=optimized out, sys_argv=optimized out) at mpg123.c:1177 -- Len Sorensen -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20140224172736.gk17...@csclub.uwaterloo.ca
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Fri, 21 Feb 2014 11:25:12 -0500 schrieb Lennart Sorensen lsore...@csclub.uwaterloo.ca: Testing with the neon build I get a return code of 4, and it seems to be failing to run. It was a pain to even get it to compile. Using just the configure option, the assembler complained about the NEON instructions being invalid for the chosen cpu type. Adding -mfpu=neon to the CFLAGS made it able to compile, but it still crashes with illegal instruction. I tried with CFLAGS set to -mcpu=cortex-a15 -mfpu=neon, and that still gives illegal instruction when running it. This is weird. What happened in debian side since https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=667653#35 ? We have the current code working on this setup: device: iPod touch 4G with iOS 5.1.1 toolchain: gcc 4.2.1(from Xcode 3.2.6) on OSX 10.6.8, clang 3.3(from Xcode 5.0.2) on OSX 10.9.1 (double checked) configure script option: --host=armv7-apple-darwin --with-cpu=arm_nofpu[neon] --with-audio=dummy --disable-shared --enable-static [--enable-int-quality] Taihei also just checked the compliance of the decoder choices including NEON. That illegal instruction ... care to fire up the debugger to tell us where it actually occurs? The NEON assembly is written as plain assembler input (cpp + as), you can see the instructions we use right there and it doesn't differ from iOS. It might be a good idea to have the benchmark script actuall check the return code of system() Yes. I was building and testing under Debian armhf sid. gcc (Debian 4.8.2-16) 4.8.2 CPU is a dual Cortex-A15 1.5GHz (TI OMAP 57xx). Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Fri, Feb 21, 2014 at 01:29:40AM +, peter green wrote: Thomas Orgis wrote: So, I got conversion to float implemented now and tested with the generic_nofpu decoder on x86-64. It _should_ of course work with ARM, too;-) If you'd like to check the current snapshot of mpg123, http://mpg123.org/snapshot/mpg123-20140220132548.tar.bz2 , you hopefull will find that any normal build of mpg123 (unless specifying --disable-float explicitly) now offers all usual formats. As a bonus, I even implemented the 8 Bit A-Law output, which has always just been a placeholder (nobody missed it, apparently). I'd be interested on some timings of mpg123 -t -e s16 test.mp3 mpg123 -t -e f32 test.mp3 with the various builds you'll do for the ARM variants. Best would be running perl scripts/benchmark-cpu.pl src/mpg123 convergence_-_points_of_view/*.mp3 with http://mpg123.orgis.org/convergence_-_points_of_view.tar.gz as reference album, as mentioned on http://mpg123.orgis.org/benchmarking.shtml to be able to compare the performance of the code and machine to others. This yields output like this: #mpg123 benchmark (user CPU time in seconds for decoding) #decoder t_s16/s t_f32/s x86-64 3.394.05 generic 6.156.01 generic_dither 6.365.97 ... or this, with --with-cpu=generic_fpu: #mpg123 benchmark (user CPU time in seconds for decoding) #decoder t_s16/s t_f32/s generic 6.146.29 (on a Core2Duo machine) Ok, on a 1GHz freescale IMX53 (cortex A8) in a (probablly somewhat out of date) debian sid armhf chroot I tested with perl scripts/benchmark-cpu.pl src/mpg123 convergence_-_points_of_view/*.mp3 in the following configurations. Built with ./configure --with-cpu=arm_nofpu #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s ARM 30.36 34.26 Built with ./configure --with-cpu=generic_fpu #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 148.66 138.49 Build with CFLAGS=-mfpu=neon ./configure --with-cpu=neon #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s NEON0.030.04 I found the neon result unbelivable so I decided to run the test program you mentioned to me in my private mail asking about how to run the benchmarks. root@plugwash:/mpg123-test# LD_LIBRARY_PATH=/mpg123-20140220132548-arm_nofpu/src/libmpg123/.libs/ perl compliance.pl /mpg123-20140220132548-arm_nofpu/src/mpg123 Layer 1 -- 16 bit signed integer output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) fl3.bit:RMS=3.485293e-02 (FAIL) maxdiff=5.008245e-02 (FAIL) fl4.bit:RMS=1.510105e-01 (FAIL) maxdiff=5.277658e-01 (FAIL) fl5.bit:RMS=3.109439e-01 (FAIL) maxdiff=4.475173e-01 (FAIL) fl6.bit:RMS=1.649138e-01 (FAIL) maxdiff=4.589995e-01 (FAIL) fl7.bit:RMS=2.211659e-02 (FAIL) maxdiff=2.959942e-01 (FAIL) fl8.bit:RMS=3.484906e-02 (FAIL) maxdiff=5.002034e-02 (FAIL) -- 32 bit integer output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) fl3.bit:RMS=3.485293e-02 (FAIL) maxdiff=5.008245e-02 (FAIL) fl4.bit:RMS=1.513207e-01 (FAIL) maxdiff=4.787517e-01 (FAIL) fl5.bit:RMS=3.109439e-01 (FAIL) maxdiff=4.475173e-01 (FAIL) fl6.bit:RMS=1.649138e-01 (FAIL) maxdiff=4.589995e-01 (FAIL) fl7.bit:RMS=2.211659e-02 (FAIL) maxdiff=2.959942e-01 (FAIL) fl8.bit:RMS=3.484906e-02 (FAIL) maxdiff=5.002034e-02 (FAIL) -- 24 bit integer output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) fl3.bit:RMS=3.485293e-02 (FAIL) maxdiff=5.008245e-02 (FAIL) fl4.bit:RMS=1.494715e-01 (FAIL) maxdiff=4.984906e-01 (FAIL) fl5.bit:RMS=3.109439e-01 (FAIL) maxdiff=4.475173e-01 (FAIL) fl6.bit:RMS=1.649138e-01 (FAIL) maxdiff=4.589995e-01 (FAIL) fl7.bit:RMS=2.211659e-02 (FAIL) maxdiff=2.959942e-01 (FAIL) fl8.bit:RMS=3.484906e-02 (FAIL) maxdiff=5.002034e-02 (FAIL) -- 32 bit floating point output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) fl3.bit:RMS=3.485293e-02 (FAIL) maxdiff=5.008245e-02 (FAIL) fl4.bit:RMS=1.137037e-01 (FAIL) maxdiff=4.459082e-01 (FAIL) fl5.bit:RMS=3.109439e-01 (FAIL) maxdiff=4.475173e-01 (FAIL) fl6.bit:RMS=1.649138e-01 (FAIL) maxdiff=4.589995e-01 (FAIL) fl7.bit:RMS=2.211659e-02 (FAIL) maxdiff=2.959942e-01 (FAIL) fl8.bit:RMS=3.484906e-02 (FAIL) maxdiff=5.002034e-02 (FAIL) Layer 2 -- 16 bit signed integer output fl10.bit:
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
I see. In that case, I'll have to leave the package as it until something along those lines is implemented. So, I got conversion to float implemented now and tested with the generic_nofpu decoder on x86-64. It _should_ of course work with ARM, too;-) If you'd like to check the current snapshot of mpg123, http://mpg123.org/snapshot/mpg123-20140220132548.tar.bz2 , you hopefull will find that any normal build of mpg123 (unless specifying --disable-float explicitly) now offers all usual formats. As a bonus, I even implemented the 8 Bit A-Law output, which has always just been a placeholder (nobody missed it, apparently). I'd be interested on some timings of mpg123 -t -e s16 test.mp3 mpg123 -t -e f32 test.mp3 with the various builds you'll do for the ARM variants. Best would be running perl scripts/benchmark-cpu.pl src/mpg123 convergence_-_points_of_view/*.mp3 with http://mpg123.orgis.org/convergence_-_points_of_view.tar.gz as reference album, as mentioned on http://mpg123.orgis.org/benchmarking.shtml to be able to compare the performance of the code and machine to others. This yields output like this: #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s x86-64 3.394.05 generic 6.156.01 generic_dither 6.365.97 ... or this, with --with-cpu=generic_fpu: #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 6.146.29 (on a Core2Duo machine). Yes, you can do that - build several copies of the library and use the hwcaps / auxv approach to pick the best one for the hardware at link time. NEON detection may come... but if we have linker selection, that would be covered right now. Yup. Seconding the second part: Linker selection it is. NEON runtime detection just isn't fun in user code. The bright side: If the multiple builds are setup and tested, I can safely release mpg123-1.19.0 with the changes and we finally have this settled. Alrighty then, Thomas signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Thomas Orgis wrote: So, I got conversion to float implemented now and tested with the generic_nofpu decoder on x86-64. It _should_ of course work with ARM, too;-) If you'd like to check the current snapshot of mpg123, http://mpg123.org/snapshot/mpg123-20140220132548.tar.bz2 , you hopefull will find that any normal build of mpg123 (unless specifying --disable-float explicitly) now offers all usual formats. As a bonus, I even implemented the 8 Bit A-Law output, which has always just been a placeholder (nobody missed it, apparently). I'd be interested on some timings of mpg123 -t -e s16 test.mp3 mpg123 -t -e f32 test.mp3 with the various builds you'll do for the ARM variants. Best would be running perl scripts/benchmark-cpu.pl src/mpg123 convergence_-_points_of_view/*.mp3 with http://mpg123.orgis.org/convergence_-_points_of_view.tar.gz as reference album, as mentioned on http://mpg123.orgis.org/benchmarking.shtml to be able to compare the performance of the code and machine to others. This yields output like this: #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s x86-64 3.394.05 generic 6.156.01 generic_dither 6.365.97 ... or this, with --with-cpu=generic_fpu: #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 6.146.29 (on a Core2Duo machine) Ok, on a 1GHz freescale IMX53 (cortex A8) in a (probablly somewhat out of date) debian sid armhf chroot I tested with perl scripts/benchmark-cpu.pl src/mpg123 convergence_-_points_of_view/*.mp3 in the following configurations. Built with ./configure --with-cpu=arm_nofpu #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s ARM 30.36 34.26 Built with ./configure --with-cpu=generic_fpu #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 148.66 138.49 Build with CFLAGS=-mfpu=neon ./configure --with-cpu=neon #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s NEON0.030.04 I found the neon result unbelivable so I decided to run the test program you mentioned to me in my private mail asking about how to run the benchmarks. root@plugwash:/mpg123-test# LD_LIBRARY_PATH=/mpg123-20140220132548-arm_nofpu/src/libmpg123/.libs/ perl compliance.pl /mpg123-20140220132548-arm_nofpu/src/mpg123 Layer 1 -- 16 bit signed integer output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) fl3.bit:RMS=3.485293e-02 (FAIL) maxdiff=5.008245e-02 (FAIL) fl4.bit:RMS=1.510105e-01 (FAIL) maxdiff=5.277658e-01 (FAIL) fl5.bit:RMS=3.109439e-01 (FAIL) maxdiff=4.475173e-01 (FAIL) fl6.bit:RMS=1.649138e-01 (FAIL) maxdiff=4.589995e-01 (FAIL) fl7.bit:RMS=2.211659e-02 (FAIL) maxdiff=2.959942e-01 (FAIL) fl8.bit:RMS=3.484906e-02 (FAIL) maxdiff=5.002034e-02 (FAIL) -- 32 bit integer output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) fl3.bit:RMS=3.485293e-02 (FAIL) maxdiff=5.008245e-02 (FAIL) fl4.bit:RMS=1.513207e-01 (FAIL) maxdiff=4.787517e-01 (FAIL) fl5.bit:RMS=3.109439e-01 (FAIL) maxdiff=4.475173e-01 (FAIL) fl6.bit:RMS=1.649138e-01 (FAIL) maxdiff=4.589995e-01 (FAIL) fl7.bit:RMS=2.211659e-02 (FAIL) maxdiff=2.959942e-01 (FAIL) fl8.bit:RMS=3.484906e-02 (FAIL) maxdiff=5.002034e-02 (FAIL) -- 24 bit integer output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) fl3.bit:RMS=3.485293e-02 (FAIL) maxdiff=5.008245e-02 (FAIL) fl4.bit:RMS=1.494715e-01 (FAIL) maxdiff=4.984906e-01 (FAIL) fl5.bit:RMS=3.109439e-01 (FAIL) maxdiff=4.475173e-01 (FAIL) fl6.bit:RMS=1.649138e-01 (FAIL) maxdiff=4.589995e-01 (FAIL) fl7.bit:RMS=2.211659e-02 (FAIL) maxdiff=2.959942e-01 (FAIL) fl8.bit:RMS=3.484906e-02 (FAIL) maxdiff=5.002034e-02 (FAIL) -- 32 bit floating point output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) fl3.bit:RMS=3.485293e-02 (FAIL) maxdiff=5.008245e-02 (FAIL) fl4.bit:RMS=1.137037e-01 (FAIL) maxdiff=4.459082e-01 (FAIL) fl5.bit:RMS=3.109439e-01 (FAIL) maxdiff=4.475173e-01 (FAIL) fl6.bit:RMS=1.649138e-01 (FAIL) maxdiff=4.589995e-01 (FAIL) fl7.bit:RMS=2.211659e-02 (FAIL) maxdiff=2.959942e-01 (FAIL) fl8.bit:RMS=3.484906e-02 (FAIL) maxdiff=5.002034e-02 (FAIL) Layer 2 -- 16 bit signed integer output fl10.bit: RMS=3.528939e-02 (FAIL) maxdiff=6.501251e-02 (FAIL) fl11.bit: RMS=3.528947e-02 (FAIL) maxdiff=6.501383e-02 (FAIL) fl12.bit: RMS=3.528948e-02
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
I'm adding the mpg123 assembly guru to the CC list, as I imagine he would be interested in why his ARM NEON code doesn't work on a Cortex A8 chip here. Needless to say, it worked before (on other systems). Also, the precision of the arm_nofpu code does not look right. This topic is now shifting towards mpg123 development, but as long as it's only on this debian platform that it's not working, I guess it is on-topic for debian, too. Am Fri, 21 Feb 2014 01:29:40 + schrieb peter green plugw...@p10link.net: Ok, on a 1GHz freescale IMX53 (cortex A8) in a (probablly somewhat out of date) debian sid armhf chroot Built with ./configure --with-cpu=arm_nofpu #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s ARM 30.36 34.26 Built with ./configure --with-cpu=generic_fpu #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s generic 148.66 138.49 That seems to prove a point about trying to use the nofpu build. How does --with-cpu=generic_nofpu stack up for this machine? Also regarding the compliance test later on ... Build with CFLAGS=-mfpu=neon ./configure --with-cpu=neon #mpg123 benchmark (user CPU time in seconds for decoding) #decodert_s16/s t_f32/s NEON0.030.04 Yeah, as we see Illegal instruction this is most interesting. I refer to Taihei, as I don't have a NEON setup at hand (need to get a debian chroot going on my phone). root@plugwash:/mpg123-test# LD_LIBRARY_PATH=/mpg123-20140220132548-arm_nofpu/src/libmpg123/.libs/ perl compliance.pl /mpg123-20140220132548-arm_nofpu/src/mpg123 Layer 1 -- 16 bit signed integer output fl1.bit:RMS=3.486054e-02 (FAIL) maxdiff=5.002832e-02 (FAIL) fl2.bit:RMS=3.485670e-02 (FAIL) maxdiff=5.008233e-02 (FAIL) That doesn't look pretty to me. Does it _sound_ like (metal) music (in case no audio chip there, decode to WAV with -w output.wav, I happily accept snippets, limit number of frames via -n 500). root@plugwash:/mpg123-test# LD_LIBRARY_PATH=/mpg123-20140220132548-generic_fpu/src/libmpg123/.libs/ perl compliance.pl /mpg123-20140220132548-generic_fpu/src/mpg123 Layer 1 -- 16 bit signed integer output fl1.bit:RMS=8.683659e-06 (PASS) maxdiff=1.525879e-05 (PASS) fl2.bit:RMS=8.686681e-06 (PASS) maxdiff=1.525879e-05 (PASS) fl3.bit:RMS=8.737660e-06 (PASS) maxdiff=1.525879e-05 (PASS) Yes, that is better. Can you compare --with-cpu=generic_nofpu to isolate this to the assembly version for ARM? This is how it looks with generic_nofpu on my box: sh$ perl ../test/compliance.pl src/mpg123 Layer 1 -- 16 bit signed integer output fl1.bit:RMS=7.936754e-06 (PASS) maxdiff=2.533197e-05 (PASS) fl2.bit:RMS=7.837830e-06 (PASS) maxdiff=2.342463e-05 (PASS) fl3.bit:RMS=7.928321e-06 (PASS) maxdiff=2.485514e-05 (PASS) fl4.bit:RMS=7.784658e-06 (PASS) maxdiff=2.521276e-05 (PASS) fl5.bit:RMS=1.677634e-05 (LIMITED) maxdiff=6.681681e-05 (FAIL) fl6.bit:RMS=1.071518e-05 (LIMITED) maxdiff=4.619360e-05 (PASS) fl7.bit:RMS=7.469690e-06 (PASS) maxdiff=2.658367e-05 (PASS) fl8.bit:RMS=7.923985e-06 (PASS) maxdiff=2.604723e-05 (PASS) -- 32 bit integer output fl1.bit:RMS=7.936754e-06 (PASS) maxdiff=2.533197e-05 (PASS) fl2.bit:RMS=7.837830e-06 (PASS) maxdiff=2.342463e-05 (PASS) fl3.bit:RMS=7.928321e-06 (PASS) maxdiff=2.485514e-05 (PASS) fl4.bit:RMS=7.784658e-06 (PASS) maxdiff=2.521276e-05 (PASS) fl5.bit:RMS=1.677634e-05 (LIMITED) maxdiff=6.681681e-05 (FAIL) fl6.bit:RMS=1.071518e-05 (LIMITED) maxdiff=4.619360e-05 (PASS) fl7.bit:RMS=7.469690e-06 (PASS) maxdiff=2.658367e-05 (PASS) fl8.bit:RMS=7.923985e-06 (PASS) maxdiff=2.604723e-05 (PASS) -- 24 bit integer output fl1.bit:RMS=7.936754e-06 (PASS) maxdiff=2.533197e-05 (PASS) fl2.bit:RMS=7.837830e-06 (PASS) maxdiff=2.342463e-05 (PASS) fl3.bit:RMS=7.928321e-06 (PASS) maxdiff=2.485514e-05 (PASS) fl4.bit:RMS=7.784658e-06 (PASS) maxdiff=2.521276e-05 (PASS) fl5.bit:RMS=1.677634e-05 (LIMITED) maxdiff=6.681681e-05 (FAIL) fl6.bit:RMS=1.071518e-05 (LIMITED) maxdiff=4.619360e-05 (PASS) fl7.bit:RMS=7.469690e-06 (PASS) maxdiff=2.658367e-05 (PASS) fl8.bit:RMS=7.923985e-06 (PASS) maxdiff=2.604723e-05 (PASS) -- 32 bit floating point output fl1.bit:RMS=7.936754e-06 (PASS) maxdiff=2.533197e-05 (PASS) fl2.bit:RMS=7.837830e-06 (PASS) maxdiff=2.342463e-05 (PASS) fl3.bit:RMS=7.928321e-06 (PASS) maxdiff=2.485514e-05 (PASS) fl4.bit:RMS=7.784658e-06 (PASS) maxdiff=2.521276e-05 (PASS) fl5.bit:RMS=1.677634e-05 (LIMITED) maxdiff=6.681681e-05 (FAIL) fl6.bit:RMS=1.071518e-05 (LIMITED) maxdiff=4.619360e-05 (PASS) fl7.bit:RMS=7.469690e-06 (PASS) maxdiff=2.658367e-05 (PASS) fl8.bit:RMS=7.923985e-06 (PASS) maxdiff=2.604723e-05 (PASS)
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
Am Mon, 17 Feb 2014 10:00:48 +0200 schrieb Riku Voipio riku.voi...@iki.fi: Thanks Peter for explaining, this was how I ended up the suggestion in the bug. I see. In that case, I'll have to leave the package as it until something along those lines is implemented. Yes. The ideal solution is for the upstream to implement cpu runtime detection that: 1) uses neon if it is available 2) falls back to fixed point if app requested 16-bit playback 3) finally falls back to generic fpu code if neither of above applies Any packaging level workaround is going to be suboptimal for someone. Isn't the approach for the linker to select libraries like libavcodec on the table anymore? I see that I'll have to add that float conversion code to keep the features along all builds, but selecting a vfp and non-vfp variant for fixed point or floating point via the linker seems like the most clean approach you are going to get. NEON detection may come... but if we have linker selection, that would be covered right now. So ... can I get away with adding that stupid float conversion, so folks have reasonable performance in likely applications of debian on ARM, please? ;-) Alrighty then, Thomas PS: I'll have to remove those experimental markings from the nofpu variants in configure help. They are getting old. signature.asc Description: PGP signature
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On Mon, Feb 17, 2014 at 11:43:16AM +0100, Thomas Orgis wrote: Am Mon, 17 Feb 2014 10:00:48 +0200 schrieb Riku Voipio riku.voi...@iki.fi: Thanks Peter for explaining, this was how I ended up the suggestion in the bug. I see. In that case, I'll have to leave the package as it until something along those lines is implemented. Yes. The ideal solution is for the upstream to implement cpu runtime detection that: 1) uses neon if it is available 2) falls back to fixed point if app requested 16-bit playback 3) finally falls back to generic fpu code if neither of above applies Any packaging level workaround is going to be suboptimal for someone. Isn't the approach for the linker to select libraries like libavcodec on the table anymore? I see that I'll have to add that float conversion code to keep the features along all builds, but selecting a vfp and non-vfp variant for fixed point or floating point via the linker seems like the most clean approach you are going to get. Yes, you can do that - build several copies of the library and use the hwcaps / auxv approach to pick the best one for the hardware at link time. NEON detection may come... but if we have linker selection, that would be covered right now. Yup. -- Steve McIntyre, Cambridge, UK.st...@einval.com Is there anybody out there? -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20140217123430.ga12...@einval.com
Re: Bug#738981: Fwd: Bug#738981: Switch to use generic_fpu for ARM
On 2014-02-17, Steve McIntyre st...@einval.com wrote: Yes, you can do that - build several copies of the library and use the hwcaps / auxv approach to pick the best one for the hardware at link time. NEON detection may come... but if we have linker selection, that would be covered right now. Yup. Qt is heading from doing autodetection at runtime to the hwcaps/auxv approach on all archs, because runtime detection also can have its issues sometimes, especially when you have inlinable code ... (Yes. I can expand if requested) /Sune -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/ldt127$v3t$1...@ger.gmane.org