----- Forwarded message from PyTables <[EMAIL PROTECTED]> -----

#152: creating a custom filter
---------------------------------+------------------------------------------
Reporter:  [EMAIL PROTECTED]          |       Owner:  somebody
    Type:  enhancement           |      Status:  new     
Priority:  major                 |   Component:  PyTables
 Version:  trunk                 |    Keywords:          
---------------------------------+------------------------------------------
 Can I get some pointers on how to create my own filter for PyTables? It is
 fairly clear to me how this should be done at HDF5 API level, but in
 PyTables a list of filters seems to be quite hard-wired. I am considering
 PyTables for a database that stores large amount of genomic sequence (5
 letter alphabet - 4 nucleotides - ACTG plus N for "unknown" nucleotide).
 Such sequences can be efficiently encoded with 2 bits per nucleotide for
 storage (4x compression), but processing them is more convenient in
 decoded "byte per nucleotide" form. Encoding/decoding looks like a natural
 filter procedure. For typical real sequences, further zlib compression of
 bit-encoded data has no benefit or even inflates the data. The best
 specialized and very costly algorithms give < 15% additional compression.
 So, bit encoding is all that is needed. I will be willing to contribute
 the filter code back to PyTables source.

-- 
Ticket URL: <http://www.pytables.org/trac/ticket/152>
PyTables <http://www.pytables.org/>
Hierarchical datasets in Python

----- End forwarded message -----

(I'm moving the discussion here since the list is the preferred location
for asking questions.)

I guess that your proposal should be very useful for people using
PyTables in genomics (there are already some users in that field).  If
some of them is reading this, it'd be interesting if they gave their
opinion in this matter, so that we can build a richer panorama.

Just one question: if your alphabet has 5 letters, how are you planning
to use only 2 bits to encode it?  Or maybe the N is never actually
appearing in data?  Otherwise, I guess you need at least 3 bits.  Also,
users should be warned that using the filter on non-ACTGN data will
render it useless... well, good documentation on the valid input domain
should do the trick.

So, well, there are no pointers as such to add a new filter to PyTables,
but you can for instance look for "bzip2", which is the latest added
compressor, and copy from what you get (I guess your filter is simpler
since it doesn't have external dependencies like bzip2)::

  debian/ptrepack.1 -- ptrepack manual page
  doc/xml/usersguide.xml -- documentation!
  setup.py -- pyrex_extnames and Extension entry
  src/_comp_bzip2.pyx -- bzip2 extension
  src/H5ARRAY.c -- add support to H5ARRAYmake
  src/H5TB-opt.c -- add support to H5TBOmake_table
  src/H5VLARRAY.c -- add support to H5VLARRAYmake
  src/H5Zbzip2.c -- define registration function and implement filter
  src/H5Zbzip2.h -- declare filter id and registration function
  src/utils.c -- include header
  src/utilsExtension.pyx -- initialize and register, whichLibVersion
  tables/filters.py -- all_complibs, docstrings
  tables/scripts/ptrepack.py -- usage string
  tables/tests/test_all.py -- print_versions
  tables/tests/test_....py -- VERY IMPORTANT: add some tests!

Now I'm listing the changesets related with the addition of bzip2
support.  Since files have been changed several times, I don't think
this will be of much help, but there they go: 764, 765, 767, 844, 1256,
1446, 1451, 1462, 1471, 2515, 3051 (try
http://www.pytables.org/trac/changeset/NUMBER).

Thanks for your support, and good luck!

PS: I'm closing the ticket since we don't have much to do with it, but
feel free to reopen it when you have a patch.  In the meantime, please
use the list for support.

::

        Ivan Vilata i Balaguer   >qo<   http://www.carabos.com/
               Cárabos Coop. V.  V  V   Enjoy Data
                                  ""

Attachment: signature.asc
Description: Digital signature

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to