----- Forwarded message from PyTables <[EMAIL PROTECTED]> ----- #152: creating a custom filter ---------------------------------+------------------------------------------ Reporter: [EMAIL PROTECTED] | Owner: somebody Type: enhancement | Status: new Priority: major | Component: PyTables Version: trunk | Keywords: ---------------------------------+------------------------------------------ Can I get some pointers on how to create my own filter for PyTables? It is fairly clear to me how this should be done at HDF5 API level, but in PyTables a list of filters seems to be quite hard-wired. I am considering PyTables for a database that stores large amount of genomic sequence (5 letter alphabet - 4 nucleotides - ACTG plus N for "unknown" nucleotide). Such sequences can be efficiently encoded with 2 bits per nucleotide for storage (4x compression), but processing them is more convenient in decoded "byte per nucleotide" form. Encoding/decoding looks like a natural filter procedure. For typical real sequences, further zlib compression of bit-encoded data has no benefit or even inflates the data. The best specialized and very costly algorithms give < 15% additional compression. So, bit encoding is all that is needed. I will be willing to contribute the filter code back to PyTables source.
-- Ticket URL: <http://www.pytables.org/trac/ticket/152> PyTables <http://www.pytables.org/> Hierarchical datasets in Python ----- End forwarded message ----- (I'm moving the discussion here since the list is the preferred location for asking questions.) I guess that your proposal should be very useful for people using PyTables in genomics (there are already some users in that field). If some of them is reading this, it'd be interesting if they gave their opinion in this matter, so that we can build a richer panorama. Just one question: if your alphabet has 5 letters, how are you planning to use only 2 bits to encode it? Or maybe the N is never actually appearing in data? Otherwise, I guess you need at least 3 bits. Also, users should be warned that using the filter on non-ACTGN data will render it useless... well, good documentation on the valid input domain should do the trick. So, well, there are no pointers as such to add a new filter to PyTables, but you can for instance look for "bzip2", which is the latest added compressor, and copy from what you get (I guess your filter is simpler since it doesn't have external dependencies like bzip2):: debian/ptrepack.1 -- ptrepack manual page doc/xml/usersguide.xml -- documentation! setup.py -- pyrex_extnames and Extension entry src/_comp_bzip2.pyx -- bzip2 extension src/H5ARRAY.c -- add support to H5ARRAYmake src/H5TB-opt.c -- add support to H5TBOmake_table src/H5VLARRAY.c -- add support to H5VLARRAYmake src/H5Zbzip2.c -- define registration function and implement filter src/H5Zbzip2.h -- declare filter id and registration function src/utils.c -- include header src/utilsExtension.pyx -- initialize and register, whichLibVersion tables/filters.py -- all_complibs, docstrings tables/scripts/ptrepack.py -- usage string tables/tests/test_all.py -- print_versions tables/tests/test_....py -- VERY IMPORTANT: add some tests! Now I'm listing the changesets related with the addition of bzip2 support. Since files have been changed several times, I don't think this will be of much help, but there they go: 764, 765, 767, 844, 1256, 1446, 1451, 1462, 1471, 2515, 3051 (try http://www.pytables.org/trac/changeset/NUMBER). Thanks for your support, and good luck! PS: I'm closing the ticket since we don't have much to do with it, but feel free to reopen it when you have a patch. In the meantime, please use the list for support. :: Ivan Vilata i Balaguer >qo< http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data ""
signature.asc
Description: Digital signature
------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users