Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
On Sun, Feb 12, 2012 at 08:00, Carsten Hey cars...@debian.org wrote: * Aron Xu [2012-02-09 01:22 +0800]: Some packages come with data files that endianness matters, and many of them are large enough to split into a separate arch:all package if endianness were not something to care about. ... Debian Policy, begin of section 5.6.8: | Depending on context and the control file used, the Architecture field | can include the following sets of values: | * A unique single word identifying a Debian machine architecture as | described in Architecture specification strings, Section 11.1. | * An architecture wildcard identifying a set of Debian machine | architectures, see Architecture wildcards, Section 11.1.1. any | matches all Debian machine architectures and is the most frequently | used. | * all, which indicates an architecture-independent package. | * source, which indicates a source package. Possible addition to solve your problem: * littleendian[1], which indicates a package that is installable on all little endian architectures. * bigendian[1], which indicates a package that is installable on all big endian architectures. I agree this will help a lot, and the endians may be shortened as le and be. But there's still file collision if maintainer doesn't install them in different paths, but that's another story. debian-policy people, would you like to take this idea? What's the steps to make this (possibly) happen? -- Regards, Aron Xu -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/CAMr=8w6zX=2_u9qgswi-sno+axxskr4fpk4au1n7cbxtkjd...@mail.gmail.com
Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
* Aron Xu [2012-02-09 01:22 +0800]: Some packages come with data files that endianness matters, and many of them are large enough to split into a separate arch:all package if endianness were not something to care about. ... Debian Policy, begin of section 5.6.8: | Depending on context and the control file used, the Architecture field | can include the following sets of values: | * A unique single word identifying a Debian machine architecture as |described in Architecture specification strings, Section 11.1. | * An architecture wildcard identifying a set of Debian machine |architectures, see Architecture wildcards, Section 11.1.1. any |matches all Debian machine architectures and is the most frequently |used. | * all, which indicates an architecture-independent package. | * source, which indicates a source package. Possible addition to solve your problem: * littleendian[1], which indicates a package that is installable on all little endian architectures. * bigendian[1], which indicates a package that is installable on all big endian architectures. The following paragraph could be (changes are marked in a wdiff like format): | In the main debian/control file in the source package, this field may | contain the special value all, the special architecture wildcard{+s+} | any{+ or endian (which matches littleendian and bigendian)+}, or | a list of specific and wildcard architectures separated by spaces. If | all{+, endian+} or any appears, that value must be the entire contents | of the field. Most packages will use either all or any. [1] The dash before endian to make it more readable is omitted to make the resulting architecture wildcards (see Debian Policy, section 11.1.1) more consistent with the existing ones. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/2012021216.ga17...@furrball.stateful.de
Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote: On Thu, Feb 9, 2012 at 01:35, Simon McVittie s...@debian.org wrote: On 08/02/12 17:22, Aron Xu wrote: Some packages come with data files that endianness matters, and many of them are large enough to split into a separate arch:all package if endianness were not something to care about. AFAIK some maintainers are not aware of endianness issues in their packages and then just ignored it (not sure how many, but if any of them are discovered it should lead to RC bug). Hopefully Jakub Wilk's automatic checks for conflicting files http://people.debian.org/~jwilk/multi-arch/ will already be picking this up, in cases where the less-used-endianness architectures aren't broken already. If the less-used-endianness architectures are already broken, that's also a bug (potentially an RC one), just like code that compiles but doesn't work on a particular endianness due to other assumptions - and if nobody has noticed it yet, presumably the package doesn't have any users (or regression tests) on those architectures. Or some of them just gave up because it is less-used architecture. It would be great to have some mechanism to handle such kind of problems in Debian, to avoid forcing those data to be placed into arch:any package. If the right endianness is critical: libfoo:i386 Depends: libfoo-data-le, libfoo:powerpc Depends: libfoo-data-be, both data packages arch:all, data files in /usr/share/foo/le and /usr/share/foo/be respectively? This looks not very nice, because we need to maintain a list of architectures in debian/control, and when new architectures are added the package is potentially broken. Also, arch:all packages are usually generated by the uploading DD on one architecture, mostly amd64 and i386 today, how can he managed to generate be data files if he doesn't have access to such a machine? Adding an option to the data generator/parser and make it able to generate be/le data on any architecture seems not to be a reasonable approach. Or just make sure the data has an endianness marker, and enhance the reading package to do the right byteswapping based on the endianness marker - e.g. this has been discussed for gettext, which ended up just writing out the same endianness on all platforms. Many formats (particularly those that originated on Windows) are always little-endian, and big-endian platforms reading them just take the minor performance hit; formats that respect network byte order have the opposite situation. This is valid for most-used applications/formats like gettext, images that are designed to behave in this way, but on the contrary there are upstream that don't like to see such impact, especially due to the complexity and performance impact. Currently I am using arch:any for data files which aren't be affected with multiarch, i.e. not same or foreign. For endianness-critical data that is required to make a library working, I have to force them to be installed into /usr/lib/triplet/$package/data/ and mark them as Multiarch: same, this is sufficient to avoid breakage, but again it consumes a lot of space on mirror. Actually, what is a lot here? I mean, how many libraries are there containing endianness-critical data and how big are the actual files? Not that I'm any kind of expert, but this solution sounds reasonable to me. Hauke -- .''`. Jan Hauke Rahm j...@debian.org www.jhr-online.de : :' : Debian Developer www.debian.org `. `'` Member of the Linux Foundationwww.linux.com `- Fellow of the Free Software Foundation Europe www.fsfe.org signature.asc Description: Digital signature
Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
Sorry, the thread was broken and I saw your reply just now. On Thu, Feb 9, 2012 at 16:23, Jan Hauke Rahm j...@debian.org wrote: On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote: This is valid for most-used applications/formats like gettext, images that are designed to behave in this way, but on the contrary there are upstream that don't like to see such impact, especially due to the complexity and performance impact. Currently I am using arch:any for data files which aren't be affected with multiarch, i.e. not same or foreign. For endianness-critical data that is required to make a library working, I have to force them to be installed into /usr/lib/triplet/$package/data/ and mark them as Multiarch: same, this is sufficient to avoid breakage, but again it consumes a lot of space on mirror. Actually, what is a lot here? I mean, how many libraries are there containing endianness-critical data and how big are the actual files? Not that I'm any kind of expert, but this solution sounds reasonable to me. Hauke As far as I know, there isn't too many libraries known to have endianness-critical data, but there might be landmines because the maintainer just aren't aware about it. I have the chance to notice this problem because my team maintain several stack of input methods, which usually need to deal with linguistic data. [1] For me here is a library named libpinyin at hand to package, which has some data files of ~7.5MiB size after gzip -9 (the total size of this library is no more than 9MiB after gzip -9). We have 14 architectures on ftp-master, so the data file eats up 105MiB, while if we find some way to have only one copy for be/le, it'll only use 15MiB. And think about when it get released as a stable, a new copy of those data is making their way to the archive when new version get uploaded to unstable. Such concern is also valid to other endianness-critical data that are not bothered with Multi-Arch at present, we need to make them arch:any and in the end they are eating more and more space. [1] Performance is critical for these applications, this doesn't mean it consumes a lot of CPU percentage, but it must response very quickly to user's input - do some complex calculations to split a sentence into words and find out a list of most related suggestions, which needs to query from 10^5 ~ 10^6 lines of data several times to complete such an action. There was project tried to use something like SQLite3 but the performance is a bit frustrating, so they have now decided not to care about that but just design data format that can fit for their requirements. -- Regards, Aron Xu -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/CAMr=8w6qiM6VB_2iegzKMFx=tv+ert6lqely6naoqfpaco-...@mail.gmail.com
Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
I want to speak up about endianness of data files, this is a suggestion but not a flaw which I just want to discover the possibility of improvement to current status by the chance of implementing Multi-Arch in Debian. Some packages come with data files that endianness matters, and many of them are large enough to split into a separate arch:all package if endianness were not something to care about. AFAIK some maintainers are not aware of endianness issues in their packages and then just ignored it (not sure how many, but if any of them are discovered it should lead to RC bug). It would be great to have some mechanism to handle such kind of problems in Debian, to avoid forcing those data to be placed into arch:any package. -- Regards, Aron Xu -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/CAMr=8w494xg1bwj3lr5rqnjrgrcung-e6igqb+xt6bdygpr...@mail.gmail.com
Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
On 08/02/12 17:22, Aron Xu wrote: Some packages come with data files that endianness matters, and many of them are large enough to split into a separate arch:all package if endianness were not something to care about. AFAIK some maintainers are not aware of endianness issues in their packages and then just ignored it (not sure how many, but if any of them are discovered it should lead to RC bug). Hopefully Jakub Wilk's automatic checks for conflicting files http://people.debian.org/~jwilk/multi-arch/ will already be picking this up, in cases where the less-used-endianness architectures aren't broken already. If the less-used-endianness architectures are already broken, that's also a bug (potentially an RC one), just like code that compiles but doesn't work on a particular endianness due to other assumptions - and if nobody has noticed it yet, presumably the package doesn't have any users (or regression tests) on those architectures. It would be great to have some mechanism to handle such kind of problems in Debian, to avoid forcing those data to be placed into arch:any package. If the right endianness is critical: libfoo:i386 Depends: libfoo-data-le, libfoo:powerpc Depends: libfoo-data-be, both data packages arch:all, data files in /usr/share/foo/le and /usr/share/foo/be respectively? Or just make sure the data has an endianness marker, and enhance the reading package to do the right byteswapping based on the endianness marker - e.g. this has been discussed for gettext, which ended up just writing out the same endianness on all platforms. Many formats (particularly those that originated on Windows) are always little-endian, and big-endian platforms reading them just take the minor performance hit; formats that respect network byte order have the opposite situation. S -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4f32b26f.8050...@debian.org
Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
On Thu, Feb 9, 2012 at 01:35, Simon McVittie s...@debian.org wrote: On 08/02/12 17:22, Aron Xu wrote: Some packages come with data files that endianness matters, and many of them are large enough to split into a separate arch:all package if endianness were not something to care about. AFAIK some maintainers are not aware of endianness issues in their packages and then just ignored it (not sure how many, but if any of them are discovered it should lead to RC bug). Hopefully Jakub Wilk's automatic checks for conflicting files http://people.debian.org/~jwilk/multi-arch/ will already be picking this up, in cases where the less-used-endianness architectures aren't broken already. If the less-used-endianness architectures are already broken, that's also a bug (potentially an RC one), just like code that compiles but doesn't work on a particular endianness due to other assumptions - and if nobody has noticed it yet, presumably the package doesn't have any users (or regression tests) on those architectures. Or some of them just gave up because it is less-used architecture. It would be great to have some mechanism to handle such kind of problems in Debian, to avoid forcing those data to be placed into arch:any package. If the right endianness is critical: libfoo:i386 Depends: libfoo-data-le, libfoo:powerpc Depends: libfoo-data-be, both data packages arch:all, data files in /usr/share/foo/le and /usr/share/foo/be respectively? This looks not very nice, because we need to maintain a list of architectures in debian/control, and when new architectures are added the package is potentially broken. Also, arch:all packages are usually generated by the uploading DD on one architecture, mostly amd64 and i386 today, how can he managed to generate be data files if he doesn't have access to such a machine? Adding an option to the data generator/parser and make it able to generate be/le data on any architecture seems not to be a reasonable approach. Or just make sure the data has an endianness marker, and enhance the reading package to do the right byteswapping based on the endianness marker - e.g. this has been discussed for gettext, which ended up just writing out the same endianness on all platforms. Many formats (particularly those that originated on Windows) are always little-endian, and big-endian platforms reading them just take the minor performance hit; formats that respect network byte order have the opposite situation. This is valid for most-used applications/formats like gettext, images that are designed to behave in this way, but on the contrary there are upstream that don't like to see such impact, especially due to the complexity and performance impact. Currently I am using arch:any for data files which aren't be affected with multiarch, i.e. not same or foreign. For endianness-critical data that is required to make a library working, I have to force them to be installed into /usr/lib/triplet/$package/data/ and mark them as Multiarch: same, this is sufficient to avoid breakage, but again it consumes a lot of space on mirror. I thought about something like /usr/share/$package/data/{be,le} in arch:all, but appears to be not a reasonable solution because we need to modify the data generator/parser. -- Regards, Aron Xu -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/CAMr=8w6s+itap8usgjaqf86mffypaop+qjodetjhdyumb7a...@mail.gmail.com